"Stephen J. Turnbull" wrote: > >... > > I don't see any need for a deviation of the implementation from the > spec. Just slurp in the whole file in the specified encoding. That's phase 2. It's harder to implement so it won't be in Python 2.3. They are trying to get away with changing the *output* of the lexer/parser rather than the *input* because the lexer/parser code probably predates Unicode and certainly predates Guido's thinking about internationalization issues. We're moving in baby steps. > ... Then > cast the Unicode characters in ordinary literal strings down to > bytesize (my preference, probably with errors on Latin-1<0.5 wink>) or > reencode them (Guido's and your suggestion). People who don't like > the results in their non-Unicode literal strings (probably few) should > use hex escapes. Sure, you'll have to rewrite the parser in terms of > UTF-16. But I thought that was where you were going anyway. Sure, but a partial implementation now is better than a perfect implementation at some unspecified time in the future. > If not, it should be nearly trivial to rewrite the parser in terms of > UTF-8 (since it is a superset of ASCII and non-ASCII is currently only > allowed in comments or guarded by a (Unicode)? string literal AFAIK). > The main issue would be anything that involves counting characters > (not bytes!), I think. That would be an issue. Plus it would be the first place that the Python interpreter used UTF-8 as an internal representation. So it would also be a half-step, but it might involve more redoing later. Paul Prescod
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4