Hello, 2008/7/3 Guido van Rossum <guido at python.org>: > I don't see an answer there to the question of whether the length() > method of a Java String object containing a single surrogate pair > returns 1 or 2; I suspect it returns 2. Python 3 supports things like > chr(0x12345) and ord("\U00012345"). (And so does Python 2, using > unichr and unicode literals.) python2.6 support for supplementary characters is not ideal: >>> unichr(0x2f81a) ValueError: unichr() arg not in range(0x10000) (narrow Python build) >>> ord(u'\U0002F81A') TypeError: ord() expected a character, but string of length 2 found. \Uxxxxxxxx seems the only way to enter these characters. 3.0 is much better and passes the two tests above. The unicodedata module gives good results in both versions: >>> unicodedata.name(u'\U0002F81A') 'CJK COMPATIBILITY IDEOGRAPH-2F81A' [34311 refs] >>> unicodedata.category(u'\U0002F81A') 'Lo' With python 3.0, I found only two places that refuse large code points on narrow builds: the "%c" format, and Py_BuildValue('C'). They should be fixed. > The one thing that may be missing from Python is things like > interpretation of surrogates by functions like isalpha() and I'm okay > with adding that (since those have to loop over the entire string > anyway). In this case, a new .isascii() method would be needed for some uses. -- Amaury Forgeot d'Arc
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4