In the TC39 meeting today (2021-may-26) there was some discussion of how to match character classes that contain multi-character strings, inspired by the slide that showed the examples
[\p{RGI_Emoji}--(π§πͺ)]
[a-zA-Z(ch)(mΜ)(γγ)(π¦πΊ)(π§πͺ)(π«π·)] β [a-zA-Z(ch|mΜ|γγ|π¦πΊ|π§πͺ|π«π·)]
The proposal is to match longest strings first, so that a prefix string does not pre-empt matching a longer string. This needs to be done in runtime semantics after evaluating a set of strings (as a modified CharSet, or as a StringSet, whichever that goes).
In particular, we do not want to match strings in the order that they are written in the regular expression.
Reasons:
[\p{RGI_Emoji}--(π§πͺ)]
that we could preserve.As for the longest match specifically, note that users may have no idea how many Unicode code points it takes to write a βcharacterβ like mΜ, γγ, π§πΏ, or π§πͺ β they just want it to βworkβ. (I even had a discussion this week with a Slovak colleague who expected there to exist a single-code point way to write "ch".)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4