I've consistently found that the best way to kill a thread is to rename it accurately <wink>. Agree w/ Guido that few people really care about the differing semantics. Agree w/ Andrew that it's bad to pull a semantic switcheroo at this stage anyway: code will definitely break. Like \b(?: (?P<keyword>and|if|else|...) | (?P<identifier>[a-zA-Z_]\w*) )\b The (special)|(general) idiom relies on left-to-right match-and-out searching of alternatives to do its job correctly. Not to mention that \b is not a word-boundary assertion in the new pkg (talk about pointlessly irritating differences! at least this one could be easily hidden via brainless preprocessing). Over the long run, moving to a DFA locks Python out of the directions Perl is *moving*, namely embedding all sorts of runtime gimmicks in regexps that exploit knowing the "state of the match so far". DFAs don't work that way. I don't mind losing those possibilities, because I think the regexp sublanguage is strained beyond its limits already. But that's a decision with Big Consequences, so deserves some thought. I'd definitely like the (sometimes dramatically) increased speed a DFA can offer (btw, this code appears to use a lazily-generated DFA, to avoid the exponential *compile*-time a straightforward DFA implementation can suffer -- the code is very complex and lacks any high-level internal docs, so we better hope Henry stays in love with it <0.5 wink>). > ... > My guess is that people don't so much translate long Perl regexp's > to Python but simply transport their (always incomplete -- Larry Wall > *wants* it that way :-) knowledge of Perl regexps to Python. This is directly proportional to the number of feeble CGI programmers Python attracts <wink>. The good news is that they wouldn't know an NFA from a DFA if Larry bit Henry on the ass ... > My meta-guess is that this is also Henry Spencer's and John > Ousterhout's guess. I think Spencer strongly favors DFA semantics regardless of fashion, and Ousterhout is a pragmatist. So I trust JO's judgment more <0.9 wink>. > As for Larry Wall, I guess he really doesn't care :-) I expect he cares a lot! Because a DFA would prevent Perl from going even more insane in its present direction. About the age of the code, postings to comp.lang.tcl have Henry saying he was working on the alpha version intensely as recently as Decemeber ('98). A few complaints about the alpha release trickled in, about regexp compile speed and regexp matching speed in specific cases. Perhaps paradoxically, the latter were about especially simple regexps with long fixed substrings (where this mountain of sophisticated machinery is likely to get beat cold by an NFA with some fixed-substring lookahead smarts -- which latter Henry intended to graft into this pkg too). [Andrew] > BTW, there's an interesting reference, I assume to this code, in > _Mastering Regular Expressions_; Spencer is quoted on page 121 as > saying it's "at worst quadratic in text size.". [Guido] > Not sure if that was the same code -- this is *new* code, not > Spencer's old code. I think Friedl's book is older than the current > code. I expect this is an invariant, though: it's not natural for a DFA to know where subexpression matches begin and end, and there's a pile of xxx_dissect functions in regexec.c that use what strongly appear to be worst-case quadratic-time algorithms for figuring that out after it's known that the overall expression has *a* match. Expect too, but don't know, that only pathological cases are actually expensive. Question: has this package been released in any other context, or is it unique to Tcl? I searched in vain for an announcement (let alone code) from Henry, or any discussion of this code outside the Tcl world. whatever-happens-i-vote-we-let-them-debug-it<wink>-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4