[Henry S. Thompson] > OK, I've never contributed to this discussion, but I have a long > history of shipping widely used Python/Tkinter/XML tools (see my > homepage). I care _very_ much that heretofore I have been unable to > support full XML because of the lack of Unicode support in Python. > I've already started playing with 1.6a2 for this reason. Thanks for chiming in! > I notice one apparent mis-communication between the various > contributors: > > Treating narrow-strings as consisting of UNICODE code points <= 255 is > not necessarily the same thing as making Latin-1 the default encoding. > I don't think on Paul and Fredrik's account encoding are relevant to > narrow-strings at all. I agree that's what they are trying to tell me. > I'd rather go right away to the coherent position of byte-arrays, > narrow-strings and wide-strings. Encodings are only relevant to > conversion between byte-arrays and strings. Decoding a byte-array > with a UTF-8 encoding into a narrow string might cause > overflow/truncation, just as decoding a byte-array with a UTF-8 > encoding into a wide-string might. The fact that decoding a > byte-array with a Latin-1 encoding into a narrow-string is a memcopy > is just a side-effect of the courtesy of the UNICODE designers wrt the > code points between 128 and 255. > > This is effectively the way our C-based XML toolset (which we embed in > Python) works today -- we build an 8-bit version which uses char* > strings, and a 16-bit version which uses unsigned short* strings, and > convert from/to byte-streams in any supported encoding at the margins. > > I'd like to keep byte-arrays at the margins in Python as well, for all > the reasons advanced by Paul and Fredrik. > > I think treating existing strings as a sort of pun between > narrow-strings and byte-arrays is a recipe for ongoing confusion. Very good analysis. Unfortunately this is where we're stuck, until we have a chance to redesign this kind of thing from scratch. Python 1.5.2 programs use strings for byte arrays probably as much as they use them for character strings. This is because way back in 1990 I when I was designing Python, I wanted to have smallest set of basic types, but I also wanted to be able to manipulate byte arrays somewhat. Influenced by K&R C, I chose to make strings and string I/O 8-bit clean so that you could read a binary "string" from a file, manipulate it, and write it back to a file, regardless of whether it was character or binary data. This model has never been challenged until now. I agree that the Java model (byte arrays and strings) or perhaps your proposed model (byte arrays, narrow and wide strings) looks better. But, although Python has had rudimentary support for byte arrays for a while (the array module, introduced in 1993), the majority of Python code manipulating binary data still uses string objects. My ASCII proposal is a compromise that tries to be fair to both uses for strings. Introducing byte arrays as a more fundamental type has been on the wish list for a long time -- I see no way to introduce this into Python 1.6 without totally botching the release schedule (June 1st is very close already!). I'd like to be able to move on, there are other important things still to be added to 1.6 (Vladimir's malloc patches, Neil's GC, Fredrik's completed sre...). For 1.7 (which should happen later this year) I promise I'll reopen the discussion on byte arrays. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4