ECMAScript 2015 introduces two new flags for regular expressions:
y
enables ‘sticky’ matching.u
enables various Unicode-related features.This article explains the effects of the u
flag. It helps if you’ve read JavaScript has a Unicode problem first.
Setting the u
flag on a regular expression enables the use of ES2015 Unicode code point escapes (\u{…}
) in the pattern.
// Note: `a` is U+0061 LATIN SMALL LETTER A, a BMP symbol.
console.log(/\u{61}/u.test('a'));
// → true// Note: `𝌆` is U+1D306 TETRAGRAM FOR CENTRE, an astral symbol.
console.log(/\u{1D306}/u.test('𝌆'));
// → true
Without the flag, things like \u{1234}
can technically still occur in patterns, but they won’t be interpreted as Unicode code point escapes. /\u{1234}/
is equivalent to /u{1234}/
, which matches 1234
consecutive u
symbols rather than the symbol with code point U+1234.
Engines do this for compatibility reasons. But with the u
flag set, this changes too: things like \a
(where a
is not an escape character) won’t be equivalent to a
anymore. So even though /\a/
is treated as /a/
, /\a/u
throws an error, because \a
is not a reserved escape sequence. This makes it possible to extend u
regular expressions in a future version of ECMAScript. For example, /\p{Script=Greek}/u
throws an exception per ES2015, but could become a regular expression that matches all symbols in the Greek script according to the Unicode database once syntax for Unicode property escapes is added to the spec.
.
operator
Without the u
flag, .
matches any BMP symbol except line terminators. When the ES2015 u
flag is set, .
matches astral symbols too.
// Note: `𝌆` is U+1D306 TETRAGRAM FOR CENTRE, an astral symbol.
const string = 'a𝌆b';console.log(/a.b/.test(string));
// → falseconsole.log(/a.b/u.test(string));
// → trueconst match = string.match(/a(.)b/u);
console.log(match[1]);
// → '𝌆'
Impact on quantifiers
The available quantifiers in JavaScript regular expressions are *
, +
, ?
, and {2}
, {2,}
, {2,4}
, and variations of those. Without the u
flag, if a quantifier follows an atom that consists of an astral symbol, it applies only to the low surrogate of that symbol.
// Note: `a` is a BMP symbol.
console.log(/a{2}/.test('aa'));
// → true// Note: `𝌆` is an astral symbol.
console.log(/𝌆{2}/.test('𝌆𝌆'));
// → false// Explanation: the previous example is equivalent to the following.
console.log(/\uD834\uDF06{2}/.test('\uD834\uDF06\uD834\uDF06'));
// → false
With the ES2015 u
flag, quantifiers apply to whole symbols, even for astral symbols.
// Note: `a` is a BMP symbol.
console.log(/a{2}/u.test('aa'));
// → true// Note: `𝌆` is an astral symbol.
console.log(/𝌆{2}/u.test('𝌆𝌆'));
// → true
Impact on character classes
Without the u
flag, any given character class can only match BMP symbols. Things like [bcd]
work as expected:
const regex = /^[bcd]$/;
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('c'), // true
regex.test('d'), // true
regex.test('e') // false
);
However, when an astral symbol is used in a character class, the JavaScript engine treats it as two separate ‘characters’: one for each of its surrogate halves.
// Note: `𝌆` is an astral symbol.
const regex = /^[bc𝌆]$/;
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('c'), // true
regex.test('𝌆') // false
);// Explanation: the regular expression is equivalent to the following.
// const regex = /^[bc\uD834\uDF06]$/;
The ES2015 u
flag enables the use of whole astral symbols in character classes.
// Note: `𝌆` is an astral symbol.
const regex = /^[bc𝌆]$/u; // Or, `/^[bc\u{1D306}]$/u`.
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('c'), // true
regex.test('𝌆') // true
);
Consequently, whole astral symbols can also be used in character class ranges, and everything will work as expected as long as the u
flag is set.
// Match any symbol from U+1F4A9 PILE OF POO to U+1F4AB DIZZY SYMBOL.
const regex = /[💩-💫]/u; // Or, `/[\u{1F4A9}-\u{1F4AB}]/u`.
console.log(
regex.test('💨'), // false
regex.test('💩'), // true
regex.test('💪'), // true
regex.test('💫'), // true
regex.test('💬') // false
);
The u
flag also affects negated character classes. For example, /[^a]/
is equivalent to /[\0-\x60\x62-\uFFFF]/
, which would match any BMP symbol except a
. But with the u
flag, /[^a]/u
matches the much bigger set of all Unicode symbols except a
.
const regex = /^[^a]$/u;
console.log(
regex.test('a'), // false
regex.test('b'), // true
regex.test('☃'), // true
regex.test('𝌆'), // true
regex.test('💩') // true
);
Impact on character class escapes
The u
flag affects the meaning of the character class escapes \D
, \S
, and \W
. Without the u
flag, \D
, \S
, and \W
match any BMP symbols that are not matched by \d
, \s
, and \w
, respectively.
const regex = /^\S$/;
console.log(
regex.test(' '), // false
regex.test('a'), // true
// Note: `𝌆` is an astral symbol.
regex.test('𝌆') // false
);
With the u
flag, \D
, \S
, and \W
match astral symbols too.
const regex = /^\S$/u;
console.log(
regex.test(' '), // false
regex.test('a'), // true
// Note: `𝌆` is an astral symbol.
regex.test('𝌆') // true
);
Their inverse counterparts \d
, \s
, and \w
are not affected by the u
flag. There was a proposal to make \d
and \w
(and \b
) more Unicode-aware, but it was rejected.
i
flag
When both the i
and u
flag are set, all symbols are implicitly case-folded using the simple mapping provided by the Unicode standard immediately before they are compared.
const es5regex = /[a-z]/i;
const es6regex = /[a-z]/iu;
console.log(
es5regex.test('s'), es6regex.test('s'), // true true
es5regex.test('S'), es6regex.test('S'), // true true
// Note: U+017F canonicalizes to `S`.
es5regex.test('\u017F'), es6regex.test('\u017F'), // false true
// Note: U+212A canonicalizes to `K`.
es5regex.test('\u212A'), es6regex.test('\u212A') // false true
);
The case folding applies to the symbols in the regular expression pattern as well as the symbols in the string to be matched.
console.log(
/\u212A/iu.test('K'), // true
/\u212A/iu.test('k'), // true
/\u017F/iu.test('S'), // true
/\u017F/iu.test('s') // true
);
This case-folding logic applies to the \w
and \W
character escapes as well, which also affects \b
and \B
. /\w/iu
matches [0-9A-Z_a-z]
but also U+017F because U+017F canonicalizes to S
which is in the match set. The same goes for U+212A and K
.
console.log(
/\w/iu.test('\u017F'), // true
/\w/iu.test('\u212A'), // true
/\W/iu.test('\u017F'), // false
/\W/iu.test('\u212A'), // false
/\W/iu.test('s'), // false
/\W/iu.test('S'), // false
/\W/iu.test('K'), // false
/\W/iu.test('k'), // false
/\b/iu.test('\u017F'), // true
/\b/iu.test('\u212A'), // true
/\b/iu.test('s'), // true
/\b/iu.test('S'), // true
/\B/iu.test('\u017F'), // false
/\B/iu.test('\u212A'), // false
/\B/iu.test('s'), // false
/\B/iu.test('S'), // false
/\B/iu.test('K'), // false
/\B/iu.test('k') // false
);
Note: An annoying result of this case-folding logic is that, per the original ES2015 spec, /w/iu
was no longer the inverse of /\W/iu
. Remember how /\w/iu
matches [0-9A-Z_a-z]
but also U+017F and U+212A? This makes sense. However, in ES2015, /\W/iu
also matched U+017F, and strangely, S
, because \W
includes U+017F which matches either the U+017F symbol itself or its canonicalized version S
. The same applied for U+212A and K
. In other words, /\W/iu
was equivalent to /[^0-9a-jl-rt-zA-JL-RT-Z_]/u
. 😕 This was rectified in June 2016. Now, /\W/iu
doesn’t match S
, K
, U+017F, or U+212A anymore, making /\W/iu
the inverse of /w/iu
again. /\W/iu
is now equivalent to /[^0-9a-zA-Z_\u{017F}\u{212A}]/u
. Whew.
Believe it or not, the existence of the u
flag has its effect on HTML documents as well.
The pattern
attribute for input
and textarea
elements allows you to specify a regular expression to validate the user’s input against. The browser then provides you with styling and scripting hooks to make stuff happen based on the input’s validity.
<style>
:invalid { background: red; }
:valid { background: green; }
</style>
<input pattern="a.b" value="aXXb"><!-- gets a red background -->
<input pattern="a.b" value="a𝌆b"><!-- gets a green background -->
The u
flag is always enabled for regular expressions compiled through the HTML pattern
attribute. Here’s a demo / test case.
At the moment, the ES2015 u
flag for regular expressions is available in stable releases of every major browser. Browsers are slowly starting to enable this functionality for the HTML pattern
attribute.
u
flag for every regular expression you write from now on.u
flag to existing regular expressions, as it might change their meaning in subtle ways.u
and i
flags. It’s better to be explicit and include all letter cases in your regular expression itself than to be surprised by implicit case folding.I created regexpu, a transpiler that rewrites ES6 Unicode regular expressions into equivalent ES5 code that works today. This enables you to play around with these upcoming new features. Try it out now!
Full-blown ES6/ES7 transpilers like Traceur and Babel depend on regexpu for their u
transpilation. Let me know if you manage to break it.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.3