Martin v. Löwis wrote: > MRAB wrote: >> Martin v. Löwis wrote: >> [snip] >>> To convert non-decodable bytes, a new error handler "python-escape" is >>> introduced, which decodes non-decodable bytes using into a private-use >>> character U+F01xx, which is believed to not conflict with private-use >>> characters that currently exist in Python codecs. >>> >>> The error handler interface is extended to allow the encode error >>> handler to return byte strings immediately, in addition to returning >>> Unicode strings which then get encoded again. >>> >>> If the locale's encoding is UTF-8, the file system encoding is set to >>> a new encoding "utf-8b". The UTF-8b codec decodes non-decodable bytes >>> (which must be >= 0x80) into half surrogate codes U+DC80..U+DCFF. >>> >> If the byte stream happens to include a sequence which decodes to >> U+F01xx, shouldn't that raise an exception? > > I apparently have not expressed it clearly, so please help me improve > the text. What I mean is this: > > - if the environment encoding (for lack of better name) is UTF-8, > Python stops using the utf-8 codec under this PEP, and switches > to the utf-8b codec. > - otherwise (env encoding is not utf-8), undecodable bytes get decoded > with the error handler. In this case, U+F01xx will not occur > in the byte stream, since no other codec ever produces this PUA > character (this is not fully true - UTF-16 may also produce PUA > characters, but they can't appear as env encodings). > So the case you are referring to should not happen. > I think what's confusing me is that you talk about mapping non-decodable bytes to U+F01xx, but you also talk about decoding to half surrogate codes U+DC80..U+DCFF. If the bytes are mapped to single half surrogate codes instead of the normal pairs (low+high), then I can see that decoding could never be ambiguous and encoding could produce the original bytes.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4