"Stephen J. Turnbull" <stephen@xemacs.org> writes: > mal> I have reworded the phase 1 implementation as follows: > > mal> 1. Implement the magic comment detection, but only apply > mal> the detected encoding to Unicode literals in the source file. > > a. Does this really make sense for UTF-16? It looks to me like a > great way to induce bugs of the form "write a unicode literal > containing 0x0A, then translate it to raw form by stripping the u > prefix." I'm not sure I understand the question. UTF-16 is not supported as a source encoding, so no, it does not make sense for it to be applied to Unicode literals. > b. No editor is likely to implement correct display to distinguish > between u"" and just "". The declared encoding applies to the entire file. In phase 1, Python does not use that for anything but Unicode literals, though. Even in phase 2, non-ASCII will be only allowed in comments and string literals. Comments are ignored by the Python lexer (except for encoding/tab declarations). For string literals, the meaning of the literal does not change even if the encoding is considered: the string literal continues to denote the same sequence of bytes. The only differences in phase two will be those: - if there is an encoding violation inside a comment or a string literal, Python will reject the source code (simply because decoding fails) - if the declared encoding uses \ or " as the second bytes of a multi-byte encoding, Python will correctly parse the string. In phase 1, it may fail to correctly determine the end of the string. > c. This definitely breaks Emacs coding cookie semantics. Emacs > applies the coding cookie to the whole buffer. So does Python. It just side-steps part of the code conversions in phase 1. > d. You probably have to deprecate ISO 2022 7-bit coding systems, too, > because people will try to get the representation of a string by > inputting a raw string in coded form. This might contain a quote > character. We don't deprecate them; we just don't support them in phase 1. Users of these encodings are encouraged to contribute a phase 2 implementation. > e. This causes problems for UTF-8 transition, since people will want > to put arbitrary byte strings in a raw string. No, they won't. Also, if the declared encoding is UTF-8, it is incorrect to put arbitrary byte strings into a string literal - but the implementation does not detect this violation. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4