

#Build finite state automata python code#

Possibly less ambitious goal: can we select a portion of Trofimovich's work to make small fixed length look-around work? It would be really nice to support ^, $ and \b, especially the Unicode variant of \b and CRLF aware $.I suspect it will require at least a few more read throughs before I understand it. Laurikari's paper is the usual reference here, but Trofimovich has a much more thorough treatment here: I've only read the paper once. Stretch goal: support capturing groups by implementing "tagged" DFA (transducers).I think the long pole in the tent here is probably the API design work and arranging it so that we don't introduce extra overhead into the non-regex-set case without duplicating a lot of code. I think we can also report the positions at each match, similar to how Aho-Corasick works. It should be possible to do this by "simply" introducing more match states. The key adaptation I think we need to make is to modify the algorithm to operate on byte ranges instead of enumerating every codepoint in the set. I believe it's possible to potentially build minimal or nearly minimal NFAs for the special case of Unicode character classes by leveraging Daciuk's algorithms for building minimal automata in linear time for sets of strings.

These can create a lot of additional work for both the determinizer and the minimizer, and I suspect this is the key thing we'll want to improve if we want to make DFA compile times faster.

However, because of a complex set of optimizations in the regex crate (like literal optimizations), an accurate performance comparison may be difficult to do. The performance difference is likely not large. Since this crate builds DFAs ahead of time, it will generally out-perform the regex crate on equivalent tasks.While no_std environments cannot compile regexes, they can deserialize pre-compiled regexes. This crate was specifically designed so that the searching phase of a DFA has minimal runtime requirements, and can therefore be used in no_std environments.Deserialization always takes constant time since searching can be performed directly on the raw serialized bytes of a DFA. Both dense and sparse DFAs can be serialized to raw bytes, and then cheaply deserialized.With some of the downsides out of the way, here are some positive differences: By default, match indices are guaranteed to fall on UTF-8 boundaries, unless RegexBuilder::allow_invalid_utf8 is enabled. There is no &str API like in the regex crate.The philosophy here is that literal optimizations should be applied at a higher level, although there is no easy support for this in the ecosystem yet. In exchange, you get predictable performance regardless of input. As a lower level crate, this library does not do literal optimizations.This crate does not support zero-width assertions such as ^, $, \b or \B.This crate does not support regex sets.For this reason, you should only use Unicode character classes if you absolutely need them! Let re = Regex :: new( r", takes under 1 millisecond and less than 5KB of memory.
