Skip Montanaro <skip@pobox.com> writes: > What's the current behavior? If my program receives an input in utf-8 > (let's say it comes from a form on a website), what form will it be in, or > can't I tell? In general, you cannot tell in advance - it will depend on the data source. W3C advocates "early normalization" towards "NFC", meaning that in the Internet, you should always see NFC data - unless you are primary data source, e.g. by reading from a terminal, or after decoding some legacy encoding. It turns out that most Python codecs will produce NFC already, so normalization to NFC would be required only for user input, and - as it turns out - when reading file names on OS X. > Is it possible I will get spurious inequalities today if I compare > two different unicode objects which were created from different > sources and in different normal forms? If they are in different normal forms, you *will* get inequalities reliably. In the real world, inequalities will be spurious. > What about a string and a unicode object? Where can I read all > about it (Python and unicode normalization)? Python does no normalization, so there is nothing to read. For Unicode, you may want to start with the Normalization FAQ http://www.unicode.org/unicode/faq/normalization.html Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4