The lexer does not correctly handle input strings containing a Unicode escape sequence like 'Fran\u00E7ois'
, due to token recognition error
. Wrapping the input stream in a CaseInsensitiveInputStream
makes it work though.
Here is a unit test demo:
@Test void testLexerUnicodeEscapes() { String s = "'Fran\\u00E7ois'"; // Using a plain CodePointCharStream fails IllegalStateException exc = assertThrows(IllegalStateException.class, () -> { tryLexing(CharStreams.fromString(s)); }); assertEquals("Syntax error on line 1:0: token recognition error at: ''Fran\\u00E'.", exc.getMessage()); // Wrapping it in a CaseInsensitiveInputStream makes it work. Why? CommonTokenStream tokens = tryLexing(new CaseInsensitiveInputStream(CharStreams.fromString(s))); assertEquals(2, tokens.size()); } private CommonTokenStream tryLexing(CharStream stream) { ApexLexer lexer = new ApexLexer(stream); lexer.removeErrorListeners(); // Avoid distracting "token recognition error" stderr output lexer.addErrorListener(new BaseErrorListener() { @Override public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line, int charPositionInLine, String msg, RecognitionException e) { throw new IllegalStateException(String.format("Syntax error on line %d:%d: %s.", line, charPositionInLine, msg)); } }); CommonTokenStream tokens = new CommonTokenStream(lexer); tokens.fill(); return tokens; }
Is this a by design or a bug? The Apex language is case-insensitive but that shouldn't affect these string values.
Notes:
CommonTokenStream
works correctly for literal non-ASCII Unicode characters like 'François'
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4