"Jason Orendorff" <jason@jorendorff.com> writes: > I gather that "coding:" is supposed to specify the > encoding (what MIME calls "charset") of the file. > But under PEP 263, it only refers to the Unicode string > literals within the program. Everything else must still > be treated as 8-bit text. Not really. If you are willing to separate the language and its implementation, then I'd phrase the intent that way: - if an encoding is declared, all of the file must follow that encoding (all of them, always (*)) - in phase 1, the implementation will not verify that property, except for Unicode literals - in phase 2, Python will implement Python completely in this respect. > For example, I'm not sure what effect "coding: utf-16" > would have. (?) Invalid; source encodings must be an ASCII superset (not sure how the implementation will react to that; if the file really is UTF-16, you'll get a syntax error, if you say it is UTF-16 but it isn't, Python will reject it in phase 2). > For another example, if you have UTF-8 Unicode string > literals in your program but you also have 8-bit > Latin-1 plain str string literals in the same program, > how should you mark it? You should mark the file as UTF-8. In phase 2, Python will reject it. At that point, you should convert your latin-1 string literal into hex escapes - it is binary data then, not Latin-1. > How will Emacs then treat it? Don't know - just try. You cannot create such a file with Emacs. > Is a Python program an 8-bit string or a Unicode string? >From the viewpoint of the language definition, it is a character string. Quoting the C++ standard "how source files are mapped to the source character set is implementation-defined". Python (the language definition) actually does define it, by means of PEP 263 (**). The source character set is Unicode, which does not necessarily mean implementations have to represent source as Unicode strings internally - they could also use the on-disk representation, as long as the implementation behaves "as-if" it did perform the mapping to Unicode. > Right now, although perhaps someone who knows more about > the parser than I can expand on this, it seems that > Python programs are 8-bit strings. That's correct, although the language definition explicitly says that usage of bytes above 128 is undefined. So Python programs, from the point of the language definition, are ASCII strings. > Therefore I argue that it makes no sense to use "coding:" to label a > Python file, because the file doesn't consist of Unicode text. You need to distinguish between the file on disk, and the text processed by the parser (something that the current parser doesn't do, except for line endings). This PEP proposes to change the way how it is currently done. If there was no change, it would not be a "Python Enhancement Proposal" Regards, Martin (*) If no encoding is declared, they must follow the system encoding. (**) The list of accepted source encodings remains implementation-defined; each Python release should spell out its list of supported encodings.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4