Shane Hathaway wrote: > Ok. Thanks for helping me understand where Python is WRT unicode. I > can work around the issues (or maybe try to help solve them) now that I > know the current state of affairs. If Python correctly handled UTF-16 > strings internally, we wouldn't need the UCS-4 configuration switch, > would we? Define correctly. Python, in ucs2 mode, will allow to address individual surrogate codes, e.g. in indexing. So you get >>> u"\U00012345"[0] u'\ud808' This will never work "correctly", and never should, because an efficient implementation isn't possible. If you want "safe" indexing and slicing, you need ucs4. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4