Just van Rossum wrote: > After quickly browsing through the unicode.org URLs I posted earlier, = I > reach the following (possibly wrong) conclusions: here's another good paper that covers this, the universe, and = everything: Character Model for the World Wide Web=20 http://www.w3.org/TR/charmod among many other things, it argues that normalization should be done at the source, and that it should be sufficient to do binary matching to = tell if two strings are identical. ... another very interesting thing from that paper is where they identify = four layers of character support: Layer 1: Physical representation. This is necessary for APIs that expose a physical representation of string data. /.../ To avoid problems with duplicates, it is assumed that the data is normalized /.../=20 Layer 2: Indexing based on abstract codepoints. /.../ This is the highest layer of abstraction that ensures interopera- bility with very low implementation effort. To avoid problems with duplicates, it is assumed that the data is normalized /.../ =20 Layer 3: Combining sequences, user-relevant. /.../ While we think that an exact definition of this layer should be possible, such a definition does not currently exist. Layer 4: Depending on language and operation. This layer is least suited for interoperability, but is necessary for certain operations, e.g. sorting.=20 until now, this discussion has focussed on the boundary between layer 1 and 2. that as many python strings as possible should be on the second layer has always been obvious to me ("a very low implementation effort" is exactly my style ;-), and leave the rest for the app. ...while Guido and MAL has argued that we should stay on level 1 (apparantly because "we've already implemented it" is less effort that "let's change a little bit") no wonder they never understand what I'm talking about... it's also interesting to see that MAL's using layer 3 and 4 issues as an argument to keep Python's string support at layer 1. in contrast, the W3 paper thinks that normalization is a non-issue also on the layer 1 level. go figure. ... btw, how about adopting this paper as the "Character Model for Python"? yes, I'm serious. </F> PS. here's my take on Just's normalization points: > - there is a script and language independent canonical form (but = automatic > normalization is indeed a bad idea) > - ideally, unicode comparisons should follow the rules from > http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly = realistic > for 1.6, if at all...) note that W3 paper recommends early normalization, and binary comparision (assuming the same internal representation of the unicode character codes, of course). > - this would indeed mean that it's possible for u =3D=3D v even though = type(u) > is type(v) and len(u) !=3D len(v). However, I don't see how this would > collapse /F's world, as the two strings are at most semantically > equivalent. Their physical difference is real, and still follows the > a-string-is-a-sequence-of-characters rule (!). yes, but on layer 3 instead of layer 2. > - there may be additional customized language-specific sorting rules. = I > currently don't see how to implement that without some global = variable. layer 4. > - the sorting rules are very complicated, and should be implemented by > calculating "sort keys". If I understood it correctly, these can take = up to > 4 bytes per character in its most compact form. Still, for it to be > somewhat speed-efficient, they need to be cached... layer 4. > - u.find() may need an alternative API, which returns a (begin, end) = tuple, > since the match may not have the same length as the search string... = (This > is tricky, since you need the begin and end indices in the = non-canonical > form...) layer 3.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4