On Wed, Aug 24, 2011 at 8:34 PM, Greg Ewing <greg.ewing at canterbury.ac.nz> wrote: > What about things like the surrogateescape codec that > deliberately use code units in non-standard ways? Will > tricks like that still be possible if the code-unit > level is hidden from the programmer? I would think that it should still be possible to explicitly put surrogates into a string, using the appropriate \uxxxx escape or chr(i) or some such approach; the basic string operations IMO shouldn't bother with checking for well-formed character sequences (just as they shouldn't care about normal forms). But decoding bytes from UTF-16 should not leave any surrogate pairs in, since interpreting those is part of the decoding. I'm not sure what should happen with UTF-8 when it (in flagrant violation of the standard, I presume) contains two separately-encoded surrogates forming a valid surrogate pair; probably whatever the UTF-8 codec does on a wide build today should be good enough. Similarly for encoding to UTF-8 on a wide build if one managed to create a string containing a surrogate pair. Basically, I'm for a garbage-in-garbage-out approach (with separate library functions to detect garbage if the app is worried about it). -- --Guido van Rossum (python.org/~guido)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4