On Wed, 3 May 2000, Just van Rossum wrote: > After quickly browsing through the unicode.org URLs I posted earlier, I > reach the following (possibly wrong) conclusions: > > - there is a script and language independent canonical form (but automatic > normalization is indeed a bad idea) > - ideally, unicode comparisons should follow the rules from > http://www.unicode.org/unicode/reports/tr10/ (But it seems hardly realistic > for 1.6, if at all...) I just looked through this document. Indeed, there's a lot of work to be done if we want to compare strings this way. I thought the most striking feature was that this comparison method does *not* satisfy the common assumption a > b implies a + c > b + d (+ is concatenation) -- in fact, it is specifically designed to allow for cases where differences in the *later* part of a string can have greater influence than differences in an earlier part of a string. It *does* still guarantee that a + b > a and of course we can still rely on the most basic rules such as a > b and b > c implies a > c There are sufficiently many significant transformations described in the UTR 10 document that i'm pretty sure it is possible for two things to collate equally but not be equivalent. (Even after Unicode normalization, there is still the possibility of rearrangement in step 1.2.) This would be another motivation for Python to carefully separate the three types of equality: is identity-equal == value-equal <=> magnitude-equal We currently don't distinguish between the last two; the operator "<=>" is my proposal for how to spell "magnitude-equal", and in terms of outward behaviour you can consider (a <=> b) to be (a <= b and a >= b). I suspect we will find ourselves needing it if we do rich comparisons anyway. (I don't know of any other useful kinds of equality, but if you've run into this before, do pipe up...) -- ?!ng
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4