Fredrik Lundh wrote: > > MAL wrote: > > To simplify the picture: the implementation itself only sees > > UCS-2 or UCS-4 depending on the compile time option and these > > do not treat surrogates in any special way except reserve > > code points for their usage. Accordingly, unichr() should not > > create UTF-16 but UCS-2 for narrow builds and UCS-4 on wide > > builds > > you didn't answer my question: is there any reason why > unichr(0xXXXXXXXX) shouldn't return exactly the same > thing as "\UXXXXXXXX" ? > > in 2.0 and 2.1, it doesn't. in 2.2, it does. > > > (unichr() is a contructor for code units, not code points). Doesn't this answer your question ? The point I wanted to make was that unichr() is a constructor for a single code unit just like chr() is a constructor for a single code unit -- in that sense the storage format used by the implementation defines the outcome: for UCS-2 builds, it can only create UCS-2 values, for UCS-4 builds, UCS-4 values are possible as well. The question of u"\UXXXXXXXX" creating surrogates on UCS-2 builds is different: \UXXXXXXXX is an encoding of a Unicode code point, so the codec has to decide whether or not to map this to two code units or an exception on UCS-2 builds. > really? according to the documentation, it creates unicode > *characters*. so does \U, according to the documentation. > > imo, it makes more sense to let "characters" mean code points > than code units, but that's me. The term "character" is vastly overloaded. There are three different forms of interpretation: graphemes (this is what a user usually sees as character on her display), codec points (this is what Unicode encodes) and code units (this is what the implementation uses a atom for storing code points). Since Python exposes code units (u[0] gives you direct access to the implementation defined storage area) and makes no assumption about surrogates, it would not be a good idea to suddenly introduce a break in the meaning of the outcome of indexing into a Unicode string (u[0]) and len(unichr()). I know that the name unichr() does not help in this situation, the correct name would be unicodeunit(). > the important thing here is to > figure out if \U and unichr are the same thing, and fix the code > and the documentation to do/say what we mean. Right. Note that apart from agreeing on a common meaning, we should also think about the consequences of breaking len(unichr())==1, e.g. when creating a Unicode string using unichr() you'd expect to find the generated code unit at the position you appended it to the Unicode object. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Consulting & Company: http://www.egenix.com/ Python Software: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4