Martin v. Löwis wrote: > Define correctly. Python, in ucs2 mode, will allow to address individual > surrogate codes, e.g. in indexing. So you get > > >>>>u"\U00012345"[0] When Python encodes characters internally in UCS-2, I would expect u"\U00012345" to produce a UnicodeError("character can not be encoded in UCS-2"). > u'\ud808' > > This will never work "correctly", and never should, because an efficient > implementation isn't possible. If you want "safe" indexing and slicing, > you need ucs4. I agree that UCS4 is needed. There is a balancing act here; UTF-16 is widely used and takes less space, while UCS4 is easier to treat as an array of characters. Maybe we can have both: unicode objects start with an internal representation in UTF-16, but get promoted automatically to UCS4 when you index or slice them. The difference will not be visible to Python code. A compile-time switch will not be necessary. What do you think? Shane
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4