> From: M.-A. Lemburg [mailto:mal@lemburg.com] > Fredrik Lundh wrote: > > > > mal wrote: > > > This really has nothing to do with being able to support > > > surrogates or not (as Fredrik mentioned), it is the correct > > > behaviour provided UTF-16 is used as encoding for UCS-4 values > > > in Unicode literals which is what Python currently does. > > > > Really? I could have sworn that most parts of Python use > > UCS-2, not UTF-16. > The design specifies that Py_UNICODE refers to UTF-16. To make > life easier, the implementation currently assumes UCS-2 in > many parts, but this is should only be considered a > temporary situation. Since supporting UTF-16 poses some > real challenges (being a variable length encoding), full > support for surrogates was postponed to some future > implementation. Heh. Now you're being silly. Supporting UTF-16 isn't that difficult. You always know whether the character is a low surrogate or a high surrogate. The interesting question is whether or not you have your builtin Unicode object expose each 16-bit character as is, or you support iterating over Py_UCS4 characters, or you want to have a wrapping object that does the right thing here. This might be the way to go. > > Built-ins like ord, unichr, len; slicing; > > string methods; regular expressions, etc. all clearly assume > > that a Py_UNICODE is a unicode code point. > > > > My point is that we shouldn't pretend we're supporting > > UTF-16 if we don't do that throughout. > We should keep that design detail in mind though. > > As far as I can tell, cmp() is the *only* unicode function > > that thinks the internal storage is UTF-16. > > > Everything else assumes UCS-2. No, its UTF-16, it just doesn't yet handle surrogates in all of the appropriate places. :) The unicodename stuff also needs to support future named surrogate characters now. > > And for Python 2.0, it's surely easier to fix cmp() than to > > fix everything else. > Also true :-) Everything but the regular expressions would be fairly simple to add UTF-16 support to. I'd imagine that adding support for \u10FFFF in the regular expression syntax wouldn't be that hard either. > > (this is the same problem as 8-bit/unicode comparisions, where > > the current consensus is that it's better to raise an exception > > if it looks like the programmer doesn't know what he was doing, > > rather than pretend it's another encoding). > Perhaps you are right and we should #if 0 the comparison > sections related to UTF-16 for now. I'm not sure why Bill > needed the cmp() function to support surrogates... Bill ? I didn't need it to. I happened upon the code on the IBM website, so I figured I'd point it out and see what people thought of sticking it into the Python Unicode stuff. :) (Wishing Python 2.0 would ship with Unicode collation support) See the earlier comment about creating a wrapping class that handles UTF-16 issues better. > Still, it will have to be reenabled sometime in the > future when full surrogate support is added to Python. Bill
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4