[Paul Prescod] > I think that maybe an important point is getting lost here. I could be > wrong, but it seems that all of this emphasis on encodings is misplaced. In practical applications that manipulate text, encodings creep up all the time. I remember a talk or message by Andy Robinson about the messiness of producing printed reports in Japanese for a large investment firm. Most off the issues that took his time had to do with encodings, if I recall correctly. (Andy, do you remember what I'm talking about? Do you have a URL?) > > The truth of the matter is: the encoding of string objects is in the > > mind of the programmer. When I read a GIF file into a string object, > > the encoding is "binary goop". > > IMHO, it's a mistake of history that you would even think it makes sense > to read a GIF file into a "string" object and we should be trying to > erase that mistake, as quickly as possible (which is admittedly not very > quickly) not building more and more infrastructure around it. How can we > make the transition to a "binary goops are not strings" world easiest? I'm afraid that's a bigger issue than we can solve for Python 1.6. We're committed to by and large backwards compatibility while supporting Unicode -- the backwards compatibility with tons of extension module (many 3rd party) requires that we deal with 8-bit strings in basically the same way as we did before. > > The moral of all this? 8-bit strings are not going away. > > If that is a statement of your long term vision, then I think that it is > very unfortunate. Treating string literals as if they were isomorphic > with byte arrays was probably the right thing in 1991 but it won't be in > 2005. I think you're a tad too optimistic about the evolution speed of software (Windows 2000 *still* has to support DOS programs), but I see your point. As I stated in another message, in Python 3000 we'll have to consider a more Java-esque solution: *character* strings are Unicode, and for bytes we have (mutable!) byte arras. Certainly 8-bit bytes as the smallest storage unit aren't going away. > It doesn't meet the definition of string used in the Unicode spec., nor > in XML, nor in Java, nor at the W3C nor in most other up and coming > specifications. OK, so that's a good indication of where you're coming from. Maybe you should spend a little more time in the trenches and a little less in standards bodies. Standards are good, but sometimes disconnected from reality (remember ISO networking? :-). > From the W3C site: > > ""While ISO-2022-JP is not sufficient for every ISO10646 document, it is > the case that ISO10646 is a sufficient document character set for any > entity encoded with ISO-2022-JP."" And this is exactly why encodings will remain important: entities encoded in ISO-2022-JP have no compelling reason to be recoded permanently into ISO10646, and there are lots of forces that make it convenient to keep it encoded in ISO-2022-JP (like existing tools). > http://www.w3.org/MarkUp/html-spec/charset-harmful.html I know that document well. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4