On Tue, 2 May 2000, Guido van Rossum wrote: > > P. P. S. If always having to specify encodings is really too much, > > i'd probably be willing to consider a default-encoding state on the > > Unicode class, but it would have to be a stack of values, not a > > single value. > > Please elaborate? On general principle, it seems bad to just have a "set" method that encourages people to set static state in a way that irretrievably loses the current state. For something like this, you want a "push" method and a "pop" method with which to bracket a series of operations, so that you can easily write code which politely leaves other code unaffected. For example: >>> x = unicode("d\351but") # assume Guido-ASCII wins UnicodeError: ASCII encoding error: value out of range >>> x = unicode("d\351but", "latin-1") >>> x u'd\351but' >>> print x.encode("latin-1") # on my xterm with Latin-1 fonts début >>> x.encode("utf-8") 'd\303\251but' Now: >>> u"".pushenc("latin-1") # need a better interface to this? >>> x = unicode("d\351but") # okay now >>> x u'd\351but' >>> u"".pushenc("utf-8") >>> x = unicode("d\351but") UnicodeError: UTF-8 decoding error: invalid data >>> x = unicode("d\303\251but") >>> print x.encode("latin-1") début >>> str(x) 'd\303\251\but' >>> u"".popenc() # back to the Latin-1 encoding >>> str(x) 'd\351but' . . . >>> u"".popenc() # back to the ASCII encoding Similarly, imagine: >>> x = u"<Japanese text...>" >>> file = open("foo.jis", "w") >>> file.pushenc("iso-2022-jp") >>> file.uniwrite(x) . . . >>> file.popenc() >>> import sys >>> sys.stdout.write(x) # bad! x contains chars > 127 UnicodeError: ASCII decoding error: value out of range >>> sys.stdout.pushenc("iso-2022-jp") >>> sys.stdout.write(x) # on a kterm with kanji fonts <Japanese text...> . . . >>> sys.stdout.popenc() The above examples incorporate the Guido-ASCII proposal, which makes a fair amount of sense to me now. How do they look to y'all? This illustrates the remaining wart: >>> sys.stdout.pushenc("iso-2022-jp") >>> print x # still bad! str is still doing ASCII UnicodeError: ASCII decoding error: value out of range >>> u"".pushenc("iso-2022-jp") >>> print x # on a kterm with kanji fonts <Japanese text...> Writing to files asks the file object to convert from Unicode to bytes, then write the bytes. Printing converts the Unicode to bytes first with str(), then hands the bytes to the file object to write. This wart is really a larger printing issue. If we want to solve it, files have to know what to do with objects, i.e. print x doesn't mean sys.stdout.write(str(x) + "\n") instead it means sys.stdout.printout(x) Hmm. I think this might deserve a separate subject line. -- ?!ng
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4