Fredrik Lundh wrote: > > Mike wrote: > > Surely using a different type on different platforms means that we throw > > away the concept of a platform independent Unicode string? > > I.e. on Solaris, wchar_t is 32 bits, on Windows it is 16 bits. > > so? the interchange format doesn't have to be > the same as the internal format, does it? The interchange format (marshal + pickle) is defined as UTF-8, so there's no problem with endianness or missing bits w/r to shipping Unicode data from one platform to another. > > Does this mean that to transfer a file between a Windows box and Solaris, an > > implicit conversion has to be done to go from 16 bits to 32 bits (and vice > > versa)? What about byte ordering issues? > > no problem at all: unicode has special byte order > marks for this purpose (and utf-8 doesn't care, of > course). Access to this mark will go into sys: sys.bom. > > Or do you mean whatever 16 bit data type is available on the platform, with > > a standard (platform independent) byte ordering maintained? > > well, my preference is a 16-bit data type in the plat- > form's native byte order (exactly how it's done in the > unicode module -- for the moment, it can use the > platform's wchar_t, but only if it happens to be a > 16-bit unsigned type). gives you good performance, > compact storage, and cleanest possible code. The 0.4 proposal fixes this to 16-bit unsigned short using UTF-16 encoding with checks for surrogates. This covers all defined standard Unicode character points, is fast, etc. pp... -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 49 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4