At 10:36 AM +0200 02-05-2000, M.-A. Lemburg wrote: >Just a small note on the subject of a character being atomic >which seems to have been forgotten by the discussing parties: > >Unicode itself can be understood as multi-word character >encoding, just like UTF-8. The reason is that Unicode entities >can be combined to produce single display characters (e.g. >u"e"+u"\u0301" will print "=E9" in a Unicode aware renderer). Erm, are you sure Unicode prescribes this behavior, for this example? I know similar behaviors are specified for certain languages/scripts, but I didn't know it did that for latin. >Slicing such a combined Unicode string will have the same >effect as slicing UTF-8 data. Not true. As Fredrik noted: no exception will be raised. [ Speaking of exceptions, after I sent off my previous post I realized Guido's non-utf8-strings-interpreted-as-utf8-will-often-raise-an-exception argument can easily be turned around, backfiring at utf-8: Defaulting to utf-8 when going from Unicode to 8-bit and back only gives the *illusion* things "just work", since it will *silently* "work", even if utf-8 is *not* the desired 8-bit encoding -- as shown by Fredrik's excellent "fun with Unicode, part 1" example. Defaulting to Latin-1 will warn the user *much* earlier, since it'll barf when converting a Unicode string that contains any character code > 255. So there. ] >It seems that most Latin-1 proponents seem to have single >display characters in mind. While the same is true for >many Unicode entities, there are quite a few cases of >combining characters in Unicode 3.0 and the Unicode >nomarization algorithm uses these as basis for its >work. Still, two combining characters are still two input characters for the renderer! They may result in one *glyph*, but trust me, that's an entirly different can of worms. However, if you'd be talking about Unicode surrogates, you'd definitely have a point. How do Java/Perl/Tcl deal with surrogates? Just
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4