On 5/17/2018 3:01 PM, Larry Hastings wrote: > > > I fed this into tokenize.tokenize(): > > b''' x = "\u1234" ''' > > I was a bit surprised to see \Uxxxx in the output. Particularly because > the output (t.string) was a *string* and not *bytes*. For those (like me) who have no idea how to use tokenize.tokenize's wacky interface, the test code is: list(tokenize.tokenize(io.BytesIO(b''' x = "\u1234" ''').readline)) > Maybe I'm making a parade of my ignorance, but I assumed that string > literals were parsed by the parser--just like everything else is parsed > by the parser, hey it seems like a good place for it--and in particular > that the escape sequence substitutions would be done in the tokenizer. > Having stared at it a little, I now detect a whiff of "this design > solved a real problem". So... what was the problem, and how does this > design solve it? I assume the intent is to not throw away any information in the lexer, and give the parser full access to the original string. But that's just a guess. > BTW, my use case is that I hoped to use CPython's tokenizer to parse > some Python-ish-looking text and handle double-quoted strings for me. > *Especially* all the escape sequences--leveraging all CPython's support > for funny things like \U{penguin}. The current behavior of the > tokenizer makes me think it'd be easier to roll my own! Can you feed the token text to the ast? >>> ast.literal_eval('"\u1234"') 'ሴ' Eric
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4