"Martin v. Loewis" wrote: > > Guido van Rossum <guido@python.org> writes: > > > > This makes Latin-1 the right choice: > > > > > > * Unicode literals already use it today > > > > But they shouldn't, IMO. > > I agree. I recommend to deprecate this feature, and raise a > DeprecationWarning if a Unicode literal contains non-ASCII characters > but no encoding has been declared. > > > Sorry, I don't understand what you're trying to say here. Can you > > explain this with an example? Why can't we require any program > > encoded in more than pure ASCII to have an encoding magic comment? I > > guess I don't understand why you mean by "raw binary". > > With the proposed implementation, the encoding declaration is only > used for Unicode literals. In all other places where non-ASCII > characters can occur (comments, string literals), those characters are > treated as "bytes", i.e. it is not verified that these bytes are > meaningful under the declared encoding. > > Marc's original proposal was to apply the declared encoding to the > complete source code, but I objected claiming that it would make the > tokenizer changes more complex, and the resulting tokenizer likely > significantly slower (atleast if you use the codecs API to perform the > decoding). I don't think that the codecs will significantly slow down overall compilation -- the compiler is not fast to begin with. However, changing the bsae type in the tokenizer and compiler from char* to Py_UNICODE* will be a significant effort and that's why I added two phases to the implementation. The first phase will only touch Unicode literals as proposed by Martin. > In phase 2, the encoding will apply to all strings. So it will not be > possible to put arbitrary byte sequences in a string literal, atleast > if the encoding disallows certain byte sequences (like UTF-8, or > ASCII). Since this is currently possible, we have a backwards > compatibility problem. Right and I believe that a lot of people in European countries write strings literals with a Latin-1 encoding in mind. We cannot simply break all that code. The other problem is with comments found in Python source code. In phase 2 these will break as well. So how about this: In phase 1, the tokenizer checks the *complete file* for non-ASCII characters and outputs single warning per file if it doesn't find a coding declaration at the top. Unicode literals continue to use [raw-]unicode-escape as codec. In phase 2, we enforce ASCII as default encoding, i.e. the warning will turn into an error. The [raw-]unicode-escape codec will be extended to also support converting Unicode to Unicode, that is, only handle escape sequences in this case. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4