"M.-A. Lemburg" wrote: > >... > > Hmm, I guess you have something like this in mind... > > 1. read the file > 2. decode it into Unicode assuming some fixed per-file encoding > 3. tokenize the Unicode content > 4. compile it, Right. This is how XML, Java, Perl etc. work. XML and Python would be the only languages to actually declare the encoding in use (in ASCII). I think that the declaration way is clearly superior to depending on command line arguments or BOMs. But this is just how it has to *look* to the user. If there is an implementation that behind the scenes only decodes Unicode literals, that would be fine. > ... creating Unicode objects from the given Unicode data > and creating string objects from the Unicode literal data > by first reencoding the Unicode data into 8-bit string data Or we could just disallow non-ASCII 8-bit strings literals in files that use the declaration. That was never a feature Guido really intended to support (as I understand it!) and I don't see a need to carry it forward. If you are in the Unicode universe then the need to put binary data in 8-bit string literals is massively reduced. > To make this backwards compatible, the implementation would have to > assume Latin-1 as the original file encoding if not given (otherwise, > binary data currently stored in 8-bit strings wouldn't make the > roundtrip). Another way to think about it is that files without the declaration skip directly to the tokenize step and skip the decoding step. -- Take a recipe. Leave a recipe. Python Cookbook! http://www.ActiveState.com/pythoncookbook
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4