[MAL] > > > Unicode itself can be understood as multi-word character > > > encoding, just like UTF-8. The reason is that Unicode entities > > > can be combined to produce single display characters (e.g. > > > u"e"+u"\u0301" will print "é" in a Unicode aware renderer). > > > Slicing such a combined Unicode string will have the same > > > effect as slicing UTF-8 data. [/F] > > really? does it result in a decoder error? or does it just result > > in a rendering error, just as if you slice off any trailing character > > without looking... [MAL] > In the example, if you cut off the u"\u0301", the "e" would > appear without the acute accent, cutting off the u"e" would > probably result in a rendering error or worse put the accent > over the next character to the left. > > UTF-8 is better in this respect: it warns you about > the error by raising an exception when being converted to > Unicode. I think /F's point was that the Unicode standard prescribes different behavior here: for UTF-8, a missing or lone continuation byte is an error; for Unicode, accents are separate characters that may be inserted and deleted in a string but whose display is undefined under certain conditions. (I just noticed that this doesn't work in Tkinter but it does work in wish. Strange.) > FYI: Normalization is needed to make comparing Unicode > strings robust, e.g. u"é" should compare equal to u"e\u0301". Aha, then we'll see u == v even though type(u) is type(v) and len(u) != len(v). /F's world will collapse. :-) --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4