Fredrik Lundh wrote: > > mal wrote: > > This really has nothing to do with being able to support > > surrogates or not (as Fredrik mentioned), it is the correct > > behaviour provided UTF-16 is used as encoding for UCS-4 values > > in Unicode literals which is what Python currently does. > > Really? I could have sworn that most parts of Python use > UCS-2, not UTF-16. The design specifies that Py_UNICODE refers to UTF-16. To make life easier, the implementation currently assumes UCS-2 in many parts, but this is should only be considered a temporary situation. Since supporting UTF-16 poses some real challenges (being a variable length encoding), full support for surrogates was postponed to some future implementation. > Built-ins like ord, unichr, len; slicing; > string methods; regular expressions, etc. all clearly assume > that a Py_UNICODE is a unicode code point. > > My point is that we shouldn't pretend we're supporting > UTF-16 if we don't do that throughout. We should keep that design detail in mind though. > As far as I can tell, cmp() is the *only* unicode function > that thinks the internal storage is UTF-16. > > Everything else assumes UCS-2. True. > And for Python 2.0, it's surely easier to fix cmp() than to > fix everything else. Also true :-) > (this is the same problem as 8-bit/unicode comparisions, where > the current consensus is that it's better to raise an exception > if it looks like the programmer doesn't know what he was doing, > rather than pretend it's another encoding). Perhaps you are right and we should #if 0 the comparison sections related to UTF-16 for now. I'm not sure why Bill needed the cmp() function to support surrogates... Bill ? Still, it will have to be reenabled sometime in the future when full surrogate support is added to Python. > ::: > > To summarize, here's the "character encoding guidelines" for > Python 2.0: > > In Unicode context, 8-bit strings contain ASCII. Characters > in the 0x80-0xFF range are invalid. 16-bit strings contain > UCS-2. Characters in the 0xD800-0xDFFF range are invalid. The latter is not true. In fact, thanks to Bill, the UTF-8 codec supports processing surrogates already and will output correct UTF-8 code even for Unicode strings containing surrogates. > If you want to use any other encoding, use the codecs pro- > vided by the Unicode subsystem. If you need to use Unicode > characters that cannot be represented as UCS-2, you cannot > use Python 2.0's Unicode subsystem. See above. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4