Paul Moore wrote: > On 03/07/2008, Guido van Rossum <guido at python.org> wrote: >> I don't see an answer there to the question of whether the length() >> method of a Java String object containing a single surrogate pair >> returns 1 or 2; I suspect it returns 2. > > It appears you're right: > >> type testucs.java > class testucs { > public static void main(String[] args) { > StringBuilder s = new StringBuilder("Hello, "); > s.appendCodePoint(0x2F81A); > System.out.println(s); // Display the string. > System.out.println(s.length()); > } > } > >> java testucs > Hello, ? > 9 > >> java -version > java version "1.6.0_05" > Java(TM) SE Runtime Environment (build 1.6.0_05-b13) > Java HotSpot(TM) Client VM (build 10.0-b19, mixed mode, sharing) > >> Python 3 supports things like >> chr(0x12345) and ord("\U00012345"). (And so does Python 2, using >> unichr and unicode literals.) > > And Java doesn't appear to - that appendCodePoint() method was > wonderfully hard to find :-) > There's also the issue of indexing the Unicode strings. If we are going to insist that len(u) counts surrogate pairs as one character then random access to the characters of a string is going to be an extremely inefficient operation. Surely it's desirable under all circumstances that len(u) == sum(1 for c in u) and that [c for c in u] == [c[i] for i in range(*len(u))] How would that play under Jeroen's proposed change? regards Steve -- Steve Holden +1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4