Fredrik Lundh wrote: > > M.-A. Lemburg <mal@lemburg.com> wrote: > > Just a small note on the subject of a character being atomic > > which seems to have been forgotten by the discussing parties: > > > > Unicode itself can be understood as multi-word character > > encoding, just like UTF-8. The reason is that Unicode entities > > can be combined to produce single display characters (e.g. > > u"e"+u"\u0301" will print "é" in a Unicode aware renderer). > > Slicing such a combined Unicode string will have the same > > effect as slicing UTF-8 data. > > really? does it result in a decoder error? or does it just result > in a rendering error, just as if you slice off any trailing character > without looking... In the example, if you cut off the u"\u0301", the "e" would appear without the acute accent, cutting off the u"e" would probably result in a rendering error or worse put the accent over the next character to the left. UTF-8 is better in this respect: it warns you about the error by raising an exception when being converted to Unicode. > > It seems that most Latin-1 proponents seem to have single > > display characters in mind. While the same is true for > > many Unicode entities, there are quite a few cases of > > combining characters in Unicode 3.0 and the Unicode > > normalization algorithm uses these as basis for its > > work. > > do we supported automatic normalization in 1.6? No, but it is likely to appear in 1.7... not sure about the "automatic" though. FYI: Normalization is needed to make comparing Unicode strings robust, e.g. u"é" should compare equal to u"e\u0301". -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4