A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2000-May/003840.html below:

[I18n-sig] Re: [Python-Dev] Unicode debate

[I18n-sig] Re: [Python-Dev] Unicode debateM.-A. Lemburg mal@lemburg.com
Tue, 02 May 2000 10:36:43 +0200
Just a small note on the subject of a character being atomic
which seems to have been forgotten by the discussing parties:

Unicode itself can be understood as multi-word character
encoding, just like UTF-8. The reason is that Unicode entities
can be combined to produce single display characters (e.g.
u"e"+u"\u0301" will print "é" in a Unicode aware renderer).
Slicing such a combined Unicode string will have the same
effect as slicing UTF-8 data.

It seems that most Latin-1 proponents seem to have single
display characters in mind. While the same is true for
many Unicode entities, there are quite a few cases of
combining characters in Unicode 3.0 and the Unicode
nomarization algorithm uses these as basis for its
work.

So in the end the "UTF-8 doesn't slice" argument holds for
Unicode itself too, just as it also does for many Asian
multi-byte variable length character encodings,
image formats, audio formats, database formats, etc.

You can't really expect slicing to always "just work"
without some knowledge about the data you are slicing.

-- 
Marc-Andre Lemburg
______________________________________________________________________
Business:                                      http://www.lemburg.com/
Python Pages:                           http://www.lemburg.com/python/





RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4