"Martin v. Löwis", 28.01.2011 22:49: > And indeed, when Cython is updated to 3.3, it shouldn't access the UTF-8 > representation for such a loop. Instead, it should access the str > representation Sure. >> Regarding Cython specifically, the above will still be *possible* under >> the proposal, given that the memory layout of the strings will still >> represent the Unicode code points. It will just be trickier to implement >> in Cython's type system as there is no longer a (user visible) C type >> representation for those code units. > > There is: Py_UCS4 remains available. Thanks for that pointer. I had always thought that all "*UCS4*" names were platform specific and had completely missed that type. It's a lot nicer than Py_UNICODE because it allows users to fold surrogate pairs back into the character value. It's completely missing from the docs, BTW. Google doesn't give me a single mention for all of docs.python.org, even though it existed at least since (and likely long before) Cython's oldest supported runtime Python 2.3. If I had known about that type earlier, I could have ended up making that the native Unicode character type in Cython instead of bothering with Py_UNICODE. But this can still be changed I think. Since type inference was available before native Py_UNICODE support, it's unlikely that users will have Py_UNICODE written in their code explicitly. So I can make the switch under the hood. Just to explain, a native CPython C type is much better than an arbitrary integer type, because it allows Cython to apply specific coercion rules from and to Python object types. As currently Py_UNICODE, Py_UCS4 would obviously coerce from and to a 1 character Unicode string, but it could additionally handle surrogate pair splitting and combining automatically on current 16-bit Unicode builds so that you'd get a Unicode string with two code points on coercion to Python. >> While I'm somewhat confident that I'll >> find a way to fix this in Cython, my point is just that this adds a >> certain level of complexity to C code using the new memory layout that >> simply wasn't there before. > > Understood. However, I think it is easier than you think it is. Let's see about the implications once there is an implementation. Stefan
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4