RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://github.com/apex-dev-tools/apex-parser/issues/55 below:

Lexing fails for string containing Unicode escape sequence · Issue #55 · apex-dev-tools/apex-parser · GitHub

The lexer does not correctly handle input strings containing a Unicode escape sequence like 'Fran\u00E7ois', due to token recognition error. Wrapping the input stream in a CaseInsensitiveInputStream makes it work though.

Here is a unit test demo:

    @Test
    void testLexerUnicodeEscapes() {
        String s = "'Fran\\u00E7ois'";

        // Using a plain CodePointCharStream fails
        IllegalStateException exc = assertThrows(IllegalStateException.class, () -> {
            tryLexing(CharStreams.fromString(s));
        });
        assertEquals("Syntax error on line 1:0: token recognition error at: ''Fran\\u00E'.", exc.getMessage());

        // Wrapping it in a CaseInsensitiveInputStream makes it work. Why?
        CommonTokenStream tokens = tryLexing(new CaseInsensitiveInputStream(CharStreams.fromString(s)));
        assertEquals(2, tokens.size());
    }

    private CommonTokenStream tryLexing(CharStream stream) {
        ApexLexer lexer = new ApexLexer(stream);
        lexer.removeErrorListeners(); // Avoid distracting "token recognition error" stderr output
        lexer.addErrorListener(new BaseErrorListener() {
            @Override
            public void syntaxError(Recognizer<?, ?> recognizer, Object offendingSymbol, int line,
                int charPositionInLine, String msg, RecognitionException e) {
                throw new IllegalStateException(String.format("Syntax error on line %d:%d: %s.",
                    line, charPositionInLine, msg));
            }
        });
        CommonTokenStream tokens = new CommonTokenStream(lexer);
        tokens.fill();
        return tokens;
    }

Is this a by design or a bug? The Apex language is case-insensitive but that shouldn't affect these string values.

Notes:

Upgrading ANTLR from 4.9.1 to 4.13.2 does not solve it, but it's still good practice
Lexing with CommonTokenStream works correctly for literal non-ASCII Unicode characters like 'François'

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4