A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://github.com/tc39/proposal-regexp-set-notation/issues/30 below:

GitHub · Where software is built

Proposal

The current proposal is to use deep case closure: #30 (comment) --> #30 (comment) (limited to only simple case folding)

More specifically:

Define an abstract operation SimpleCaseClosure(A) where A is a CharSet.
(Note: scf = Unicode Simple_Case_Folding: the simple mappings in CaseFolding.txt, as in the Canonicalize(ch) operation)

When building a CharSet from a character class or from a CharacterClassEscape, if IgnoreCase==true and /v is specified:

Problem description

The current draft spec text includes a TODO to discuss whether to do something special about IgnoreCase.
This becomes interesting when looking at IgnoreCase + complement + nested classes.

Notation: In Unicode=true mode, ES regular expressions apply Unicode Simple_Case_Folding, which has the short name scf.

In our little working group, we had been chewing on this question on and off without coming to a conclusion.
We had been trying to rationalize and match the existing behavior, and discussed doing an early "case closure" when IgnoreCase=true (for each c in the CharSet, add any c2 if scf(c2)=c), at least when a nested class is complemented (and maybe for any nested class regardless), with the goal of being consistent with existing matching behavior.

Then we realized that the existing matching behavior is inconsistent with itself (or at least unintuitive).

Looking at the existing spec:

In other words, with IgnoreCase=true, the matching behavior for a CharacterClass is very different from that for a CharacterClassEscape, and using a complemented CharacterClassEscape inside a CharacterClass is different from a normal CharacterClassEscape inside a complemented CharacterClass.

Example:

re1=/\p{Ll}/giu
re2=/[^\P{Ll}]/giu

Naïvely, I expected these to behave the same. Actual results:

  1. "aAbBcC4#".replaceAll(re1, 'X') outputs "XXXXXX4#"
  2. "aAbBcC4#".replaceAll(re2, 'X') outputs "aAbBcC4#"

In our proposed spec text, we currently do nothing special. Just like in the current spec text, a CharacterClassEscape simply always evaluates to a CharSet, with a code point complement as appropriate. And a nested class with brackets (which does not use the CharacterClass production to avoid having to return a (CharSet, invert) pair) also simply evaluates to a CharSet, with a code point complement for [^ ClassRanges ].

We could consider doing early case closure of nested classes and properties, and/or doing a code point complement of the CharacterClass CharSet and removing the "invert" boolean, and/or something else. We need to weigh "improving" behavior vs. making it different from existing behavior for the same or similar patterns.

RunDevelopment and mathiasbynens


RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4