The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)
More specifically:
Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)
When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v
is specified:
c
: create a CharSet A with just c
and return SimpleCaseClosure(A)a-b
: create a CharSet A with the one contiguous range of code points from a
to b
and return SimpleCaseClosure(A)
SimpleCaseClosure([a-z])
will include the separate range [A-Z]
and several other non-adjacent code points.\p{X}
: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)\P{X}
: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B\w \W \s \S
etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes[^...]
: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=falseThe current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.
Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.
In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.
Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).
Looking at the existing spec:
\
plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.[^
ClassRanges ]
) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.
Example:
re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu
Naïvely, I expected these to behave the same. Actual results:
"aAbBcC4#".replaceAll(re1, 'X')
outputs "XXXXXX4#"
"aAbBcC4#".replaceAll(re2, 'X')
outputs "aAbBcC4#"
\P{Ll}
matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^
ClassRanges ]
.
We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.
RunDevelopment and mathiasbynens
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4