"Martin v. Löwis" writes: > No, that's explicitly *not* what C6 says. Instead, it says that a > process that treats s1 and s2 differently shall not assume that others > will do the same, i.e. that it is ok to treat them the same even though > they have different code points. Treating them differently is also > conforming. Then what requirement does C6 impose, in your opinion? It sounds like you don't think it imposes any, in practice. Note that in the discussion of C6, the standard says, - Ideally, an implementation would *always* interpret two canonical-equivalent sequences *identically*. There are practical circumstances under which implementations may reasonably distinguish them. (Emphasis mine.) The examples given are things like "inspecting memory representation structure" (which properly speaking is really outside of Unicode conformance) and "ignoring collation behavior of combining sequences outside the repertoire of a specified language." That sounds like "Special cases aren't special enough to break the rules. Although practicality beats purity." to me. Treating things differently is an exceptional case, that requires sufficient justification. My understanding is that if those strings are exchanged with an another process, then whether or not treating them differently is allowed depends on whether the results will be output to another process, and what the definition of our process is. Sometimes it will be allowed, but mostly it won't. Take file names as an example. If our process is working with an external process (the OS's file system driver) whose definition includes the statement that "File names are sequences of Unicode characters", then C6 says our process must compare canonically equivalent sequences that it takes to be file names as the same, whether or not they are in the same normalized form, or normalized at all, because we can't assume the file system will treat them as different. If we do treat them as different, our users will get very upset (eg, if we don't signal a duplicate file name input by the user, and then the OS proceeds to overwrite an existing file). Dually, having made the statement that file names are Unicode, C6 says that the OS driver must return the same file given two canonically equivalent strings that happen to have different code points in them, because it may not assume that *we* will treat those strings as different names of different files. *Users* will certainly take the viewpoint that two strings that display the same on their monitor should identify the same file when they use them as file names. Now, I'm *not* saying that Python's strings *should* conform to the Unicode standard in this respect yet (or ever, for that matter; I'm with Guido on that). I'm simply saying that the current implementation of strings, as improved by PEP 393, can not be said to be conforming. I would like to see something much more conformant done as a separate library (the Python Components for Unicode, say), intended to support users who need character-based behavior, Unicode-ly correct collation, etc., more than efficiency. Applications that need both will have to make their own way at first, either by contributing improvements to the library or by using application-specific algorithms.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4