Florian Weimer, 28.01.2011 10:35: > * Stefan Behnel: >> "Martin v. Löwis", 24.01.2011 21:17: >>> The Py_UNICODE type is still supported but deprecated. It is always >>> defined as a typedef for wchar_t, so the wstr representation can double >>> as Py_UNICODE representation. >> >> It's too bad this isn't initialised by default, though. Py_UNICODE is >> the only representation that can be used efficiently from C code > > Is this really true? I don't think I've seen any C API which actually > uses wchar_t, beyond that what is provided by libc. UTF-8 and even > UTF-16 are much, much more common. They are also much harder to use, unless you are really only interested in 7-bit ASCII data - which is the case for most C libraries, so I believe that's what you meant here. However, this is the CPython runtime with built-in Unicode support, not the C runtime where it comes as an add-on at best, and where Unicode processing without being Unicode aware is common. The nice thing about Py_UNICODE is that is basically gives you native Unicode code points directly, without needing to decode UTF-8 byte runs and the like. In Cython, it allows you to do things like this: def test_for_those_characters(unicode s): for c in s: # warning: randomly chosen Unicode escapes ahead if c in u"\u0356\u1012\u3359\u4567": return True else: return False The loop runs in plain C, using the somewhat obvious implementation with a loop over Py_UNICODE characters and a switch statement for the comparison. This would look a *lot* more ugly with UTF-8 encoded byte strings. Regarding Cython specifically, the above will still be *possible* under the proposal, given that the memory layout of the strings will still represent the Unicode code points. It will just be trickier to implement in Cython's type system as there is no longer a (user visible) C type representation for those code units. It can be any of uchar, ushort16 or uint32, neither of which is necessarily a 'native' representation of a Unicode character in CPython. While I'm somewhat confident that I'll find a way to fix this in Cython, my point is just that this adds a certain level of complexity to C code using the new memory layout that simply wasn't there before. Stefan
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4