Guido van Rossum <guido@python.org> writes: > > This reminds me that I often miss, in the standard `ctype.h' and related, > > a function that would un-combine a character into its base character and > > its diacritic, and the complementary re-combining function. [...] > I bet the Unicode standard has a standard way to do this. This is called 'unicode normalization forms'. Each "pre-combined" character can also be represented as a base character, and a "combining diacritic". There are symmetric normalization forms: NFC favours pre-combined characters, NFD favours combining characters. There is also a "compatibility decomposition" (K), where e.g. ANGSTROM SIGN decomposes to LATIN CAPITAL LETTER A WITH RING ABOVE. > Maybe we can implement that, and then project the same interface on > 8-bit characters? Not really. Needing to know the character set is one issue; the other issue is that the stand-alone diacritic characters in ASCII are *not* combining. We could certainly provide a mapping between the Unicode combining diacritics and the stand-alone diacritics, say as a codec, but that would be quite special-purpose. Providing a good normalization library is necessary, though, since many other algorithms (both from W3C and IETF) require Unicode normalization as part of the processing (usually to NFKC). Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4