Just van Rossum wrote: > > At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote: > >Just a small note on the subject of a character being atomic > >which seems to have been forgotten by the discussing parties: > > > >Unicode itself can be understood as multi-word character > >encoding, just like UTF-8. The reason is that Unicode entities > >can be combined to produce single display characters (e.g. > >u"e"+u"\u0301" will print "é" in a Unicode aware renderer). > > Erm, are you sure Unicode prescribes this behavior, for this > example? I know similar behaviors are specified for certain > languages/scripts, but I didn't know it did that for latin. The details are on the www.unicode.org web-site burried in some of the tech reports on normalization and collation. > >Slicing such a combined Unicode string will have the same > >effect as slicing UTF-8 data. > > Not true. As Fredrik noted: no exception will be raised. Huh ? You will always get an exception when you convert a broken UTF-8 sequence to Unicode. This is per design of UTF-8 itself which uses the top bit to identify multi-byte character encodings. Or can you give an example (perhaps you've found a bug that needs fixing) ? > [ Speaking of exceptions, > > after I sent off my previous post I realized Guido's > non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception > argument can easily be turned around, backfiring at utf-8: > > Defaulting to utf-8 when going from Unicode to 8-bit and > back only gives the *illusion* things "just work", since it > will *silently* "work", even if utf-8 is *not* the desired > 8-bit encoding -- as shown by Fredrik's excellent "fun with > Unicode, part 1" example. Defaulting to Latin-1 will > warn the user *much* earlier, since it'll barf when > converting a Unicode string that contains any character > code > 255. So there. > ] > > >It seems that most Latin-1 proponents seem to have single > >display characters in mind. While the same is true for > >many Unicode entities, there are quite a few cases of > >combining characters in Unicode 3.0 and the Unicode > >nomarization algorithm uses these as basis for its > >work. > > Still, two combining characters are still two input characters for > the renderer! They may result in one *glyph*, but trust me, > that's an entirly different can of worms. No. Please see my other post on the subject... > However, if you'd be talking about Unicode surrogates, > you'd definitely have a point. How do Java/Perl/Tcl deal with > surrogates? Good question... anybody know the answers ? -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4