Finn Bock wrote: > > [M.-A. Lemburg] > > >BTW, does Java support UCS-4 ? If not, then Java is wrong > >here ;-) > > Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if > this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual > level of support for UCS-4 is properly debatable. > > - The builtin char is 16bit wide and can obviously not support UCS-4. > - The Character class can report if a character is a surrogate: > >>> from java.lang import Character > >>> Character.getType("\ud800") == Character.SURROGATE > 1 >>> unicodedata.category(u'\ud800') 'Cs' ... which means the same thing only in Unicode3 standards notation. Make me think: perhaps we should add the Java constants to unicodedata base. Is there a list of those available somewhere ? > - As reported, direct string comparison ignore surrogates. I would guess that this'll have to change as soon as JavaSoft folks realize that they need to handle UTF-16 and not only UCS-2. > - The BreakIterator does not handle surrogates. It does handle > combining characters and it seems a natural place to put support > for surrogates. What is a BreakIterator ? An iterator to scan line/word breaks ? > - The Collator class offers different levels of normalization before > comparing string but does not seem to support surrogates. This class > seems a natural place for javasoft to put support for surrogates > during string comparison. We'll need something like this for 2.1 too: first some standard APIs for normalization and then a few unicmp() APIs to use for sorting. We might even have to add collation sequences somewhere because this is a locale issue as well... sometimes it's even worse with different strategies used for different tasks within one locale, e.g. in Germany we sometimes sort the Umlaut รค as "ae" and at other times as extra character. > These findings are gleaned from the sources of JDK1.3 > > [*] > http://java.sun.com/docs/books/vmspec/2nd-edition/html/Concepts.doc.html#25310 > Thanks for the infos, -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4