A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2014-August/136050.html below:

[Python-Dev] surrogatepass - she's a witch, burn 'er!

[Python-Dev] surrogatepass - she's a witch, burn 'er!Stephen J. Turnbull stephen at xemacs.org
Sat Aug 30 06:21:56 CEST 2014
Greg Ewing writes:
 > M.-A. Lemburg wrote:
 > > we needed
 > > a way to make sure that Python 3 also optionally supports working
 > > with lone surrogates in such UTF-8 streams (nowadays called CESU-8:
 > > http://en.wikipedia.org/wiki/CESU-8).

Besides what Greg says, CESU-8 is an UTF, and therefore encodes valid
Unicode.  Speaking imprecisely, CESU-8 is UTF-16 with variable-width
code units (ie, each 16-bit code point is represented using the UTF-8
variable-width representation).[1]

I think you are thinking of Markus Kuhn's utf-8b (which I believe is
exactly what is implemented by the surrogateescape handler).

As far as the goal of "working with lone surrogates in such UTF-8
streams", the surrogateescape handler already permits that, and does
so consistently across streams in the sense that lone surrogates in
the UTF-8 stream cannot be mixed with garbage bytes decoded by
surrogateescape in another stream, which produces an unencodable mess.

I still don't see a justification for the surrogatepass handler.  What
applications are producing (not merely passing through) UTF-8-encoded
surrogates these days?


Footnotes: 
[1]  For the curious, it's imprecise because in Unicode code units are
fixed-width by definition.

More information about the Python-Dev mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4