Finn Bock wrote: > > CPythons unicode compare function contains some code to compare surrogate > characters in code-point order (I think). This is properly a very neat > feature but is differs from java's way of comparing strings. > > Python 2.0b1 (#0, Jul 26 2000, 21:29:11) [MSC 32 bit (Intel)] on win32 > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > Copyright 1995-2000 Corporation for National Research Initiatives (CNRI) > >>> print u'\ue000' < u'\ud800' > 1 > >>> print ord(u'\ue000') < ord(u'\ud800') > 0 > >>> > > Java (and JPython) compares the 16-bit characters numericly which result in: > > JPython 1.1+08 on java1.3.0 (JIT: null) > Copyright (C) 1997-1999 Corporation for National Research Initiatives > >>> print u'\ue000' < u'\ud800' > 0 > >>> print ord(u'\ue000') < ord(u'\ud800') > 0 > >>> > > I don't think I can come up with a solution that allow JPython to emulate > CPython on this type of comparison. The code originally worked the same way as what Java does here. Bill Tutt then added ideas from some IBM Java lib which turns the UTF-16 comparison into a true UCS-4 comparison. This really has nothing to do with being able to support surrogates or not (as Fredrik mentioned), it is the correct behaviour provided UTF-16 is used as encoding for UCS-4 values in Unicode literals which is what Python currently does. BTW, does Java support UCS-4 ? If not, then Java is wrong here ;-) Comparing Unicode strings is not as trivial as one might think: private point areas introduce a great many possibilities of getting things wrong and the fact that many characters can be expressed by combining other characters adds to the confusion. E.g. for sorting, we'd need full normalization support for Unicode and would have to come up with some smart strategy to handle private code point areas. All this is highly non-trivial and will probably not get implemented for a while (other issues are more important right now, e.g. getting the internals to use the default encoding instead of UTF-8). For now I'd suggest leaving Bill's code activated because it does the right thing for Python's Unicode implementation (which is built upon UTF-16). -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4