Nicholas Bastin wrote: > On May 6, 2005, at 3:42 PM, James Y Knight wrote: >>It means all the string operations treat strings as if they were >>UCS-2, but that in actuality, they are UTF-16. Same as the case in the >>windows APIs and Java. That is, all string operations are essentially >>broken, because they're operating on encoded bytes, not characters, >>but claim to be operating on characters. > > > Well, this is a completely separate issue/problem. The internal > representation is UTF-16, and should be stated as such. If the > built-in methods actually don't work with surrogate pairs, then that > should be fixed. Wait... are you saying a Py_UNICODE array contains either UTF-16 or UTF-32 characters, but never UCS-2? That's a big surprise to me. I may need to change my PyXPCOM patch to fit this new understanding. I tried hard to not care how Python encodes unicode characters, but details like this are important when combining two frameworks with different unicode APIs. Shane
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4