Just van Rossum wrote: > > At 8:30 AM -0400 02-05-2000, Guido van Rossum wrote: > >I think /F's point was that the Unicode standard prescribes different > >behavior here: for UTF-8, a missing or lone continuation byte is an > >error; for Unicode, accents are separate characters that may be > >inserted and deleted in a string but whose display is undefined under > >certain conditions. > > > >(I just noticed that this doesn't work in Tkinter but it does work in > >wish. Strange.) > > > >> FYI: Normalization is needed to make comparing Unicode > >> strings robust, e.g. u"È" should compare equal to u"e\u0301". ^ | Here's a good example of what encoding errors can do: the above character was an "e" with acute accent (u"é"). Looks like some mailer converted this to some other code page and yet another back to Latin-1 again and this even though the message header for Content-Type clearly states that the document uses ISO-8859-1. > > > >Aha, then we'll see u == v even though type(u) is type(v) and len(u) > >!= len(v). /F's world will collapse. :-) > > Does the Unicode spec *really* specifies u should compare equal to v? The behaviour is needed in order to implement sorting Unicode. See the www.unicode.org site for more information and the tech reports describing this. Note that I haven't mentioned anything about "automatic" normalization. This should be a method on Unicode strings and could then be used in sorting compare callbacks. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4