Wow, this almost looks like a real flamefest. ("Flame" being defined as the presence of metacomments.) (In the following, s is an 8-bit string, u is a Unicode string, and e is an encoding name.) The original design of the encode() methods of string and Unicode objects (in 2.0 and 2.1) is asymmetric, and clearly geared towards Unicode codecs only: to decode an 8-bit string you *have* to use unicode(s, encoding) while to encode a Unicode string into a specific 8-bit encoding you *have* to use u.encode(e). 8-bit strings also have an encode() method: s.encode(e) is the same as unicode(s).encode(e). (This is useful since code that expects Unicode strings should also work when it is passed ASCII-encoded 8-bit strings.) I'd say there's no need for s.decode(e), since this can already be done with unicode(s, e) -- and to me that API looks better since it clearly states that the result is Unicode. We *could* have designed the encoding API similarly: str(u, e) is available, symmetric with unicode(s, e), and a logical extension of str(u) which uses the default encoding. But I accept the argument that u.encode(e) is better because it emphasizes the encoding action, and because it means no API changes to str(). I guess what I'm saying here is that 'str' does not give enough of a clue that an encoding action is going on, while 'unicode' *does* give a clue that a decoding action is being done: as soon as you read "Unicode" you think "Mmm, encodings..." -- but "str" is pretty neutral, so u.encode(e) is needed to give a clue. Marc-Andre proposes (and has partially checked in) changes that stretch the meaning of the encode() method, and add a decode() method, to be basically interfaces to anything you can do with the codecs module. The return type of encode() and decode() is now determined by the codec (formerly, encode() always returned an 8-bit string). Some new codecs have been added that do things like gzip and base64. Initially, I liked this, and even contributed a codec. But questions keep coming up. What is the problem being solved? True, the codecs module has a clumsy interface if you just want to invoke a codec on some data. But that can easily be remedied by adding convenience functions encode() and decode() to codecs.py -- which would have the added advantage that it would work for other datatypes that support the buffer interface, e.g. codecs.encode(myPILobject, "base64"). True, the "codec" pattern can be used for other encodings than Unicode. But it seems to me that the entire codecs architecture is rather strongly geared towards en/decoding Unicode, and it's not clear how well other codecs fit in this pattern (e.g. I noticed that all the non-Unicode codecs ignore the error handling parameter or assert that it is set to 'strict'). Is it really right that x.encode("gzip") and x.encode("utf-8") look similar, while the former requires an 8-bit string and the latter only makes sense if x is a Unicode string? Another (minor) issue is that Unicode encoding names are an IANA namespace. Is it wise to add our own names to this? I'm not forcing a decision here, but I do ask that we consider these issues before forging ahead with what might be a mistake. A PEP would be most helpful to focus the discussion. --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4