Roman Suzi wrote: > > On Tue, 17 Jul 2001, M.-A. Lemburg wrote: > > > After having been through two rounds of comments with the "Unicode > > Literal Encoding" pre-PEP, it has turned out that people actually > > prefer to go for the full Monty meaning that the PEP should handle > > the complete Python source code encoding and not just the encoding > > of the Unicode literals (which are currently the only parts in a > > Python source code file for which Python assumes a fixed encoding). > > > > Here's a summary of what I've learned from the comments: > > > > 1. The complete Python source file should use a single encoding. > > Yes, certainly > > > 2. Handling of escape sequences should continue to work as it does > > now, but with all possible source code encodings, that is > > standard string literals (both 8-bit and Unicode) are subject to > > escape sequence expansion while raw string literals only expand > > a very small subset of escape sequences. > > > > 3. Python's tokenizer/compiler combo will need to be updated to > > work as follows: > > > > 1. read the file > > 2. decode it into Unicode assuming a fixed per-file encoding > > 3. tokenize the Unicode content > > 4. compile it, creating Unicode objects from the given Unicode data > > and creating string objects from the Unicode literal data > > by first reencoding the Unicode data into 8-bit string data > > using the given file encoding > > I think, that if encoding is not given, it must sillently assume "UNKNOWN" > encoding and do nothing, that is be 8-bit clean (as it is now). To be 8-bit clean it will have to use Latin-1 as fallback encoding since this encoding assures the roundtrip safety (decode to Unicode, then reencode). > Otherwise, it will slow down parser considerably. Yes, that could be an issue (I don't think it matters much though, since parsing usually only done during byte-code compilation and the results are buffered in .pyc files). > I also think that if encoding is choosen, there is no need to reencode it > back to literal strings: let them be in Unicode. That would be nice, but is not feasable at the moment (just try to run Python with -U option and see what happens...). > Or the encoding must _always_ be ASCII+something, as utf-8 for example. > Eliminating the need to bother with tokenizer (Because only docstrings, > comments and string-literals are entities which require encoding / > decoding). > > If I understood correctly, Python will soon switch to "unicode-only" > strings, as Java and Tcl did. (This is of course disaster for some Python > usage areas such as fast text-processing, but...) > > Or am I missing something? It won't switch any time soon... there's still too much work ahead and I'm also pretty sure that the 8-bit string type won't go away for backward compatibility reasons. > > To make this backwards compatible, the implementation would have to > > assume Latin-1 as the original file encoding if not given (otherwise, > > binary data currently stored in 8-bit strings wouldn't make the > > roundtrip). > > ...as I said, there must be no assumed charset. Things must > be left as is now when no explicit encoding given. This is what the Latin-1 encoding assures. > > 4. The encoding used in a Python source file should be easily > > parseable for en editor; a magic comment at the top of the > > file seems to be what people want to see, so I'll drop the > > directive (PEP 244) requirement in the PEP. > > > > Issues that still need to be resolved: > > > > - how to enable embedding of differently encoded data in Python > > source code (e.g. UTF-8 encoded XML data in a Latin-1 > > source file) > > Probably, adding explicit conversions. Yes, but there are cases where the source file having the embedded data will not decode into Unicode (I got the example backwards: think of a UTF-8 encoded source file with a Latin-1 string literal). Perhaps we should simply rule out this case and have the programmer stick to the source file encoding + some escaping or a run-time recoding of the literal data into the preferred encoding. > > - what to do with non-literal data in the source file, e.g. > > variable names and comments: > > > > * reencode them just as would be done for literals > > * only allow ASCII for certain elements like variable names > > etc. > > I think non-literal data must be in ASCII. > But it could be too cheesy to have variable names in national > alphabet ;-) That's for Guido to decide... > > - which format to use for the magic comment, e.g. > > > > * Emacs style: > > > > #!/usr/bin/python > > # -*- encoding = 'utf-8' -*- > > > > * Via meta-option to the interpreter: > > > > #!/usr/bin/python --encoding=utf-8 > > > > * Using a special comment format: > > > > #!/usr/bin/python > > #!encoding = 'utf-8' > > No variant is ideal. The 2nd is worse/best than all > (it depends on how to look at it!) > > Python has no macro directives. In this situation > they could help greatly! We've been discussing these on python-dev, but Guido is not too keen on having them. > That "#!encoding" is special case of macro directive. > > May be just put something like ''# <!DOCTYPE HTML PUBLIC'' > at the beginning... > > Or, even greater idea occured to me: allow some XML > with meta-information (not only encoding) somehow escaped. > > I think, GvR could come with some advice here... > > > Comments are welcome ! Thanks for your comments, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4