mal wrote: > This really has nothing to do with being able to support > surrogates or not (as Fredrik mentioned), it is the correct > behaviour provided UTF-16 is used as encoding for UCS-4 values > in Unicode literals which is what Python currently does. Really? I could have sworn that most parts of Python use UCS-2, not UTF-16. Built-ins like ord, unichr, len; slicing; string methods; regular expressions, etc. all clearly assume that a Py_UNICODE is a unicode code point. My point is that we shouldn't pretend we're supporting UTF-16 if we don't do that throughout. As far as I can tell, cmp() is the *only* unicode function that thinks the internal storage is UTF-16. Everything else assumes UCS-2. And for Python 2.0, it's surely easier to fix cmp() than to fix everything else. =20 (this is the same problem as 8-bit/unicode comparisions, where the current consensus is that it's better to raise an exception if it looks like the programmer doesn't know what he was doing, rather than pretend it's another encoding). ::: To summarize, here's the "character encoding guidelines" for Python 2.0: In Unicode context, 8-bit strings contain ASCII. Characters in the 0x80-0xFF range are invalid. 16-bit strings contain UCS-2. Characters in the 0xD800-0xDFFF range are invalid. If you want to use any other encoding, use the codecs pro- vided by the Unicode subsystem. If you need to use Unicode characters that cannot be represented as UCS-2, you cannot use Python 2.0's Unicode subsystem. Anything else is just a hack. </F>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4