> As far as I can tell, cmp() is the *only* unicode function > that thinks the internal storage is UTF-16. > > Everything else assumes UCS-2. > > And for Python 2.0, it's surely easier to fix cmp() than to > fix everything else. Agreed (I think). > (this is the same problem as 8-bit/unicode comparisions, where > the current consensus is that it's better to raise an exception > if it looks like the programmer doesn't know what he was doing, > rather than pretend it's another encoding). > > ::: > > To summarize, here's the "character encoding guidelines" for > Python 2.0: > > In Unicode context, 8-bit strings contain ASCII. Characters > in the 0x80-0xFF range are invalid. 16-bit strings contain > UCS-2. Characters in the 0xD800-0xDFFF range are invalid. > > If you want to use any other encoding, use the codecs pro- > vided by the Unicode subsystem. If you need to use Unicode > characters that cannot be represented as UCS-2, you cannot > use Python 2.0's Unicode subsystem. > > Anything else is just a hack. I wouldn't go so far as raising an exception when a comparison involves 0xD800-0xDFFF; after all we don't raise an exception when an ASCII string contains 0x80-0xFF either (except when converting to Unicode). The invalidity of 0xD800-0xDFFF means that these aren't valid Unicode code points; it doesn't mean that we should trap all attempts to use these values. That ways, apps that need UTF-16 awareness can code it themselves. Why? Because I don't want to proliferate code that explicitly traps 0xD800-0xDFFF throughout the code. --Guido van Rossum (home page: http://www.pythonlabs.com/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4