> > Are you sure you understand what we are arguing about? > > Here's what I thought we were arguing about: > > If you put a bunch of "funny characters" into a Python string literal, > and then compare that string literal against a Unicode object, should > those funny characters be treated as logical units of text (characters) > or as bytes? And if bytes, should some transformation be automatically > performed to have those bytes be reinterpreted as characters according > to some particular encoding scheme (probably UTF-8). > > I claim that we should *as far as possible* treat strings as character > lists and not add any new functionality that depends on them being byte > list. Ideally, we could add a byte array type and start deprecating the > use of strings in that manner. Yes, it will take a long time to fix this > bug but that's what happens when good software lives a long time and the > world changes around it. > > > Earlier, you quoted some reference documentation that defines 8-bit > > strings as containing characters. That's taken out of context -- this > > was written in a time when there was (for most people anyway) no > > difference between characters and bytes, and I really meant bytes. > > Actually, I think that that was Fredrik. Yes, I came across the post again later. Sorry. > Anyhow, you wrote the documentation that way because it was the most > intuitive way of thinking about strings. It remains the most intuitive > way. I think that that was the point Fredrik was trying to make. I just wish he made the point more eloquently. The eff-bot seems to be in a crunchy mood lately... > We can't make "byte-list" strings go away soon but we can start moving > people towards the "character-list" model. In concrete terms I would > suggest that old fashioned lists be automatically coerced to Unicode by > interpreting each byte as a Unicode character. Trying to go the other > way could cause the moral equivalent of an OverflowError but that's not > a problem. > > >>> a=1000000000000000000000000000000000000L > >>> int(a) > Traceback (innermost last): > File "<stdin>", line 1, in ? > OverflowError: long int too long to convert > > And just as with ints and longs, we would expect to eventually unify > strings and unicode strings (but not byte arrays). OK, you've made your claim -- like Fredrik, you want to interpret 8-bit strings as Latin-1 when converting (not just comparing!) them to Unicode. I don't think I've heard a good *argument* for this rule though. "A character is a character is a character" sounds like an axiom to me -- something you can't prove or disprove rationally. I have a bunch of good reasons (I think) for liking UTF-8: it allows you to convert between Unicode and 8-bit strings without losses, Tcl uses it (so displaying Unicode in Tkinter *just* *works*...), it is not Western-language-centric. Another reason: while you may claim that your (and /F's, and Just's) preferred solution doesn't enter into the encodings issue, I claim it does: Latin-1 is just as much an encoding as any other one. I claim that as long as we're using an encoding we might as well use the most accepted 8-bit encoding of Unicode as the default encoding. I also think that the issue is blown out of proportions: this ONLY happens when you use Unicode objects, and it ONLY matters when some other part of the program uses 8-bit string objects containing non-ASCII characters. Given the long tradition of using different encodings in 8-bit strings, at that point it is anybody's guess what encoding is used, and UTF-8 is a better guess than Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4