Martin v. Loewis wrote: > Guido van Rossum <guido@python.org> writes: > > >>>It could be that Apple is decomposing the filenames before comparing >>>them. Either way works. >> >>Hm, that sucks (either way) -- because you get unnormalized Unicode >>out of directory listings, which is harder to turn into local >>encodings. > > > Notice that, most likely, Apple *does* normalize them - they just use > Normal Form D (which favours decomposition, instead of using > precomposed characters) - this is what Apple apparently calls > "canonical". Both the decomposition and the composition are called "canonical" -- simply because both operations lead to predefined results (those defined by the Unicode database). http://www.unicode.org/unicode/reports/tr15/ has all the details. As always with Unicode, things are slightly more complicated than what people are normally used to (but for good reasons). The introduction of that tech report describes these things in details. Canonical equivalence basically means that the graphemes for the Unicode code points when rendered look the same to the user -- even though the code point combinations may be different. Normalization takes care of mapping this visual equivalence to an algorithm. Now, if the OS uses canonical equivalence to find file names, then all possible combinations of code points resulting in the same sequence of graphemes will give you a match; for a good reason: because the user of a GUI file manager wouldn't be able to distinguish between two canonically equivalent file names. > That choice is not surprising - NFD is "more logical", as precomposed > characters are available only arbitrarily (e.g. the WITH TILDE > combinations exist for a, i, e, n, o, u, v, y, but not for, say, x). ... but in a well-defined manner and that's what's important. > The Unicode FAQ > (http://www.unicode.org/unicode/faq/normalization.html) says > > Q: Which forms of normalization should I support? > > A: The choice of which to use depends on the particular program or > system. The most commonly supported form is NFC, since it is more > compatible with strings converted from legacy encodings. This is also > the choice for the web, as per the recommendations in "Character Model > for the World Wide Web" from the W3C. The other normalization forms > are useful for other domains. > > So I guess Python should atleast provide NFC - precisely because of > the legacy encodings. At least is good :-) NFC is NFD + canonical composition. Decomposition isn't all that hard (using unicodedata.decomposition()). For composition the situation is different: not all information is available in the unicodedata database (the exclusion list) and the database also doesn't provide the reverse mapping from decomposed code points to composed one. See the Annexes to the tech report to get an impression of just how hard combining is... Still, would be nice to have (written in C for speed, since this would be a very common operation). Zope Corp. will certainly be interested in this for Zope3 ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH _______________________________________________________________________ eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,... Python Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4