A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2002-May/024645.html below:

[Python-Dev] Re: String module

[Python-Dev] Re: String moduleMartin v. Loewis martin@v.loewis.de
30 May 2002 08:43:48 +0200
Guido van Rossum <guido@python.org> writes:

> > This reminds me that I often miss, in the standard `ctype.h' and related,
> > a function that would un-combine a character into its base character and
> > its diacritic, and the complementary re-combining function.
[...]
> I bet the Unicode standard has a standard way to do this.  

This is called 'unicode normalization forms'. Each "pre-combined"
character can also be represented as a base character, and a
"combining diacritic". There are symmetric normalization forms: NFC
favours pre-combined characters, NFD favours combining characters.

There is also a "compatibility decomposition" (K), where e.g. ANGSTROM
SIGN decomposes to LATIN CAPITAL LETTER A WITH RING ABOVE.

> Maybe we can implement that, and then project the same interface on
> 8-bit characters?

Not really. Needing to know the character set is one issue; the other
issue is that the stand-alone diacritic characters in ASCII are *not*
combining. We could certainly provide a mapping between the Unicode
combining diacritics and the stand-alone diacritics, say as a codec,
but that would be quite special-purpose.

Providing a good normalization library is necessary, though, since
many other algorithms (both from W3C and IETF) require Unicode
normalization as part of the processing (usually to NFKC).

Regards,
Martin




RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4