On Wed, 10 Nov 1999, M.-A. Lemburg wrote: >... > Well almost... it depends on the current value of <default encoding>. Default encodings are kind of nasty when they can be altered. The same problem occurred with import hooks. Only one can be present at a time. This implies that modules, packages, subsystems, whatever, cannot set a default encoding because something else might depend on it having a different value. In the end, nobody uses the default encoding because it is unreliable, so you end up with extra implementation/semantics that aren't used/needed. Have you ever noticed how Python modules, packages, tools, etc, never define an import hook? I'll bet nobody ever monkeys with the default encoding either... I say axe it and say "UTF-8" is the fixed, default encoding. If you want something else, then do that explicitly. >... > Another problem is that Unicode types differ between platforms > (MS VCLIB uses 16-bit wchar_t, while GLIBC2 uses 32-bit > wchar_t). Depending on the internal format of Unicode objects > this could mean calling different conversion APIs. Exactly the reason to avoid wchar_t. > BTW, I'm still not too sure about the underlying internal format. > The problem here is that Unicode started out as 2-byte fixed length > representation (UCS2) but then shifted towards a 4-byte fixed length > reprensetation known as UCS4. Since having 4 bytes per character > is hard sell to customers, UTF16 was created to stuff the UCS4 > code points (this is how character entities are called in Unicode) > into 2 bytes... with a variable length encoding. History is basically irrelevant. What is the situation today? What is in use, and what are people planning for right now? >... > The downside of using UTF16: it is a variable length format, > so iterations over it will be slower than for UCS4. Bzzt. May as well go with UTF-8 as the internal format, much like Perl is doing (as I recall). Why go with a variable length format, when people seem to be doing fine with UCS-2? Like I said in the other mail note: two large platforms out there are UCS-2 based. They seem to be doing quite well with that approach. If people truly need UCS-4, then they can work with that on their own. One of the major reasons for putting Unicode into Python is to increase/simplify its ability to speak to the underlying platform. Hey! Guess what? That generally means UCS2. If we didn't need to speak to the OS with these Unicode values, then people can work with the values entirely in Python, PyUnicodeType-be-damned. Are we digging a hole for ourselves? Maybe. But there are two other big platforms that have the same hole to dig out of *IF* it ever comes to that. I posit that it won't be necessary; that the people needing UCS-4 can do so entirely in Python. Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and vice-versa. But: it only does it from String to String -- you can't use Unicode objects anywhere in there. > Simply sticking to UCS2 is probably out of the question, > since Unicode 3.0 requires UCS4 and we are targetting > Unicode 3.0. Oh? Who says? Cheers, -g -- Greg Stein, http://www.lyra.org/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4