RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/tc39/proposal-regexp-set-notation/issues/30 below:

GitHub · Where software is built

Proposal

The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

More specifically:

Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)

For each single code point c in A, add every other code point d where scf(d)==c or where scf(c)==d

When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v is specified:

literal character c: create a CharSet A with just c and return SimpleCaseClosure(A)
range a-b: create a CharSet A with the one contiguous range of code points from a to b and return SimpleCaseClosure(A)
- Note: The result will often consist of two or more ranges. SimpleCaseClosure([a-z]) will include the separate range [A-Z] and several other non-adjacent code points.
\p{X}: resolve the property expression into a CharSet A and return SimpleCaseClosure(A)
\P{X}: resolve the property expression into a CharSet A, compute CharSet B = SimpleCaseClosure(A), return the code point complement of B
same with \w \W \s \S etc.: look up the property CharSet, compute the case closure, return the code point complement for backslash-uppercase escapes
set operations work as usual
[^...]: compute the inner character class expression into CharSet A and return the code point complement of A; if this is a top-level CharacterClass, then return with invert=false

Problem description

The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.

Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.

In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.

Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).

Looking at the existing spec:

The spec already allows one level of nested classes in the form of CharacterClassEscapes inside CharacterClasses.
The complement of a CharacterClassEscape (\ plus uppercase letter) is resolved immediately, computing the code point complement of its CharSet.
The complement of a CharacterClass ([^ ClassRanges ]) is deferred via the "invert" boolean until the CharacterSetMatcher operates on the CharSet, rather than complementing the CharSet itself.
When IgnoreCase=true, then Canonicalize (which the CharacterSetMatcher invokes) applies the Unicode Simple_Case_Folding (scf) before applying the "invert" complement.

In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.

Example:

re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu

Naïvely, I expected these to behave the same. Actual results:

"aAbBcC4#".replaceAll(re1, 'X') outputs "XXXXXX4#"
- The CharSet contains all lowercase letters.
- invert=false
- For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n).
- No match for digits, punctuation, etc.
"aAbBcC4#".replaceAll(re2, 'X') outputs "aAbBcC4#"
- The CharSet contains all Unicode code points other than lowercase letters: Uppercase/titlecase/other letters, digits, symbols, punctuation, unassigned, etc.
- invert=true
- For any letter m in the text, there is a letter n in the set such that scf(m)==scf(n), but "invert" negates "found".
- No match for any letters.
- No match for digits/punctuation/etc. either because they are in the CharSet directly and "invert" negates "found".
- In other words, under IgnoreCase \P{Ll} matches everything (with possible exceptions if there are unusual character properties) and so its CharacterClass complement matches nothing.

In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^ ClassRanges ].

We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.

RunDevelopment and mathiasbynens

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4