Greg Stein wrote: > > On Wed, 10 Nov 1999, M.-A. Lemburg wrote: > >... > > Well almost... it depends on the current value of <default encoding>. > > Default encodings are kind of nasty when they can be altered. The same > problem occurred with import hooks. Only one can be present at a time. > This implies that modules, packages, subsystems, whatever, cannot set a > default encoding because something else might depend on it having a > different value. In the end, nobody uses the default encoding because it > is unreliable, so you end up with extra implementation/semantics that > aren't used/needed. I know, but this is a little different: you use strings a lot while import hooks are rarely used directly by the user. E.g. people in Europe will probably prefer Latin-1 as default encoding while people in Asia will use one of the common CJK encodings. The <default encoding> decides what encoding to use for many typical tasks: printing, str(u), "s" argument parsing, etc. Note that setting the <default encoding> is not intended to be done prior to single operations. It is meant to be settable at thread creation time. > [...] > > > BTW, I'm still not too sure about the underlying internal format. > > The problem here is that Unicode started out as 2-byte fixed length > > representation (UCS2) but then shifted towards a 4-byte fixed length > > reprensetation known as UCS4. Since having 4 bytes per character > > is hard sell to customers, UTF16 was created to stuff the UCS4 > > code points (this is how character entities are called in Unicode) > > into 2 bytes... with a variable length encoding. > > History is basically irrelevant. What is the situation today? What is in > use, and what are people planning for right now? > > >... > > The downside of using UTF16: it is a variable length format, > > so iterations over it will be slower than for UCS4. > > Bzzt. May as well go with UTF-8 as the internal format, much like Perl is > doing (as I recall). > > Why go with a variable length format, when people seem to be doing fine > with UCS-2? The reason for UTF-16 is simply that it is identical to UCS-2 over large ranges which makes optimizations (e.g. the UCS2 flag I mentioned in an earlier post) feasable and effective. UTF-8 slows things down for CJK encodings, since the APIs will very often have to scan the string to find the correct logical position in the data. Here's a quote from the Unicode FAQ (http://www.unicode.org/unicode/faq/ ): """ Q: How about using UCS-4 interfaces in my APIs? Given an internal UTF-16 storage, you can, of course, still index into text using UCS-4 indices. However, while converting from a UCS-4 index to a UTF-16 index or vice versa is fairly straightforward, it does involve a scan through the 16-bit units up to the index point. In a test run, for example, accessing UTF-16 storage as UCS-4 characters results in a 10X degradation. Of course, the precise differences will depend on the compiler, and there are some interesting optimizations that can be performed, but it will always be slower on average. This kind of performance hit is unacceptable in many environments. Most Unicode APIs are using UTF-16. The low-level character indexing are at the common storage level, with higher-level mechanisms for graphemes or words specifying their boundaries in terms of the storage units. This provides efficiency at the low levels, and the required functionality at the high levels. Convenience APIs can be produced that take parameters in UCS-4 methods for common utilities: e.g. converting UCS-4 indices back and forth, accessing character properties, etc. Outside of indexing, differences between UCS-4 and UTF-16 are not as important. For most other APIs outside of indexing, characters values cannot really be considered outside of their context--not when you are writing internationalized code. For such operations as display, input, collation, editing, and even upper and lowercasing, characters need to be considered in the context of a string. That means that in any event you end up looking at more than one character. In our experience, the incremental cost of doing surrogates is pretty small. """ > Like I said in the other mail note: two large platforms out there are > UCS-2 based. They seem to be doing quite well with that approach. > > If people truly need UCS-4, then they can work with that on their own. One > of the major reasons for putting Unicode into Python is to > increase/simplify its ability to speak to the underlying platform. Hey! > Guess what? That generally means UCS2. All those formats are upward compatible (within certain ranges) and the Python Unicode API will provide converters between its internal format and the few common Unicode implementations, e.g. for MS compilers (16-bit UCS2 AFAIK), GLIBC (32-bit UCS4). > If we didn't need to speak to the OS with these Unicode values, then > people can work with the values entirely in Python, > PyUnicodeType-be-damned. > > Are we digging a hole for ourselves? Maybe. But there are two other big > platforms that have the same hole to dig out of *IF* it ever comes to > that. I posit that it won't be necessary; that the people needing UCS-4 > can do so entirely in Python. > > Maybe we can allow the encoder to do UCS-4 to UTF-8 encoding and > vice-versa. But: it only does it from String to String -- you can't use > Unicode objects anywhere in there. See above. > > Simply sticking to UCS2 is probably out of the question, > > since Unicode 3.0 requires UCS4 and we are targetting > > Unicode 3.0. > > Oh? Who says? >From the FAQ: """ Q: What is UTF-16? Unicode was originally designed as a pure 16-bit encoding, aimed at representing all modern scripts. (Ancient scripts were to be represented with private-use characters.) Over time, and especially after the addition of over 14,500 composite characters for compatibility with legacy sets, it became clear that 16-bits were not sufficient for the user community. Out of this arose UTF-16. """ Note that there currently are no defined surrogate pairs for UTF-16, meaning that in practice the difference between UCS-2 and UTF-16 is probably negligable, e.g. we could define the internal format to be UTF-16 and raise exception whenever the border between UTF-16 and UCS-2 is crossed -- sort of as political compromise ;-). But... I think HP has the last word on this one. -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4