On 29Apr2009 22:14, Stephen J. Turnbull <stephen at xemacs.org> wrote: | Baptiste Carvello writes: | > By contrast, if the new utf-8b codec would *supercede* the old one, | > \udcxx would always mean raw bytes (at least on UCS-4 builds, where | > surrogates are unused). Thus ambiguity could be avoided. | | Unfortunately, that's false. It could have come from a literal string | (similar to the text above ;-), a C extension, or a string slice (on | 16-bit builds), and there may be other ways to do it. The only way to | avoid ambiguity is to change the definition of a Python string to be | *valid* Unicode (possibly with Python extensions such as PEP 383 for | internal use only). But Guido has rejected that in the past; | validation is the application's problem, not Python's. | | Nor is a UCS-4 build exempt. IIRC Guido specifically envisioned | Python strings being used to build up code point sequences to be | directly output, which means that a UCS-4 string might none-the-less | contain surrogates being added to a string intended to be sent as | UTF-16 output simply by truncating the 32-bit code units to 16 bits. Wouldn't you then be bypassing the implicit encoding anyway, at least to some extent, and thus not trip over the PEP? -- Cameron Simpson <cs at zip.com.au> DoD#743 http://www.cskk.ezoshosting.com/cs/ Clemson is the Harvard of cardboard packaging. - overhead by WIRED at the Intelligent Printing conference Oct2006
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4