A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2009-April/089204.html below:

Non-decodable Bytes in System Character Interfaces

[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces [Python-Dev] PEP 383: Non-decodable Bytes in System Character InterfacesStephen J. Turnbull stephen at xemacs.org
Wed Apr 29 15:14:18 CEST 2009
Baptiste Carvello writes:

 > By contrast, if the new utf-8b codec would *supercede* the old one,
 > \udcxx would always mean raw bytes (at least on UCS-4 builds, where
 > surrogates are unused). Thus ambiguity could be avoided.

Unfortunately, that's false.  It could have come from a literal string
(similar to the text above ;-), a C extension, or a string slice (on
16-bit builds), and there may be other ways to do it.  The only way to
avoid ambiguity is to change the definition of a Python string to be
*valid* Unicode (possibly with Python extensions such as PEP 383 for
internal use only).  But Guido has rejected that in the past;
validation is the application's problem, not Python's.

Nor is a UCS-4 build exempt.  IIRC Guido specifically envisioned
Python strings being used to build up code point sequences to be
directly output, which means that a UCS-4 string might none-the-less
contain surrogates being added to a string intended to be sent as
UTF-16 output simply by truncating the 32-bit code units to 16 bits.

More information about the Python-Dev mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4