[M.-A. Lemburg] >Finn Bock wrote: >> >> [M.-A. Lemburg] >> >> >BTW, does Java support UCS-4 ? If not, then Java is wrong >> >here ;-) >> >> Java claims to use unicode 2.1 [*]. I couldn't locate anything describing if >> this is UCS-2 or UTF-16. I think unicode 2.1 includes UCS-4. The actual >> level of support for UCS-4 is properly debatable. >> >> - The builtin char is 16bit wide and can obviously not support UCS-4. >> - The Character class can report if a character is a surrogate: >> >>> from java.lang import Character >> >>> Character.getType("\ud800") == Character.SURROGATE >> 1 > >>>> unicodedata.category(u'\ud800') >'Cs' > >... which means the same thing only in Unicode3 standards >notation. > >Make me think: perhaps we should add the Java constants to >unicodedata base. Is there a list of those available >somewhere ? UNASSIGNED = 0 UPPERCASE_LETTER LOWERCASE_LETTER TITLECASE_LETTER MODIFIER_LETTER OTHER_LETTER NON_SPACING_MARK ENCLOSING_MARK COMBINING_SPACING_MARK DECIMAL_DIGIT_NUMBER LETTER_NUMBER OTHER_NUMBER SPACE_SEPARATOR LINE_SEPARATOR PARAGRAPH_SEPARATOR CONTROL FORMAT PRIVATE_USE SURROGATE DASH_PUNCTUATION START_PUNCTUATION END_PUNCTUATION CONNECTOR_PUNCTUATION OTHER_PUNCTUATION MATH_SYMBOL CURRENCY_SYMBOL MODIFIER_SYMBOL OTHER_SYMBOL >> - As reported, direct string comparison ignore surrogates. > >I would guess that this'll have to change as soon as JavaSoft >folks realize that they need to handle UTF-16 and not only >UCS-2. Predicting the future can be difficult, but here is my take: javasoft will never change the way String.compareTo works. String.compareTo is documented as: """ Compares two strings lexicographically. The comparison is based on the Unicode value of each character in the strings. ... """ Instead they will mark it as a very naive string comparison and suggest users to use the Collator classes for anything but the simplest cases. >> - The BreakIterator does not handle surrogates. It does handle >> combining characters and it seems a natural place to put support >> for surrogates. > >What is a BreakIterator ? An iterator to scan line/word breaks ? Yes, as well as character breaks. It already contains the framework for marking two chars next to each other as one. >> - The Collator class offers different levels of normalization before >> comparing string but does not seem to support surrogates. This class >> seems a natural place for javasoft to put support for surrogates >> during string comparison. > >We'll need something like this for 2.1 too: first some >standard APIs for normalization and then a few unicmp() >APIs to use for sorting. > >We might even have to add collation sequences somewhere because >this is a locale issue as well... sometimes it's even worse >with different strategies used for different tasks within one >locale, e.g. in Germany we sometimes sort the Umlaut ä as "ae" >and at other times as extra character. Info: The java Collator class is configured with - a locale and - a strengh parameter IDENTICAL; all difference are significant. PRIMARY (a vs b) SECONDARY (a vs ä) TERTIARY (a vs A) - a decomposition (http://www.unicode.org/unicode/reports/tr15/) NO_DECOMPOSITION CANONICAL_DECOMPOSITION FULL_DECOMPOSITION regards, finn
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4