> Con: URI encoding does not encode characters. OK, for all the people who say URI encoding does not encode characters: yes it does. This is not an encoding for binary data, it's an encoding for character data, but it's unspecified how the strings map to octets before being percent-encoded. From RFC 3986, section 1.2.1<http://tools.ietf.org/html/rfc3986#section-1.2.1> : Percent-encoded octets (Section 2.1) may be used within a URI to represent > characters outside the range of the US-ASCII coded character set if this > representation is allowed by the scheme or by the protocol element in which > the URI is referenced. Such a definition should specify the character > encoding used to map those characters to octets prior to being > percent-encoded for the URI. So the string->string proposal is actually correct behaviour. I'm all in favour of a bytes->string version as well, just not with the names "quote" and "unquote". I'll prepare a new patch shortly which has bytes->string and string->bytes versions of the functions as well. (quote will accept either type, while unquote will output a str, there will be a new function unquote_to_bytes which outputs a bytes - is everyone happy with that?) Guido says: > Actually, we'd need to look at the various other APIs in Py3k before we can > decide whether these should be considered taking or returning bytes or text. > It looks like all other APIs in the Py3k version of urllib treat URLs as > text. Yes, as I said in the bug tracker, I've groveled over the entire stdlib to see how my patch affects the behaviour of dependent code. Aside from a few minor bits which assumed octets (and did their own encoding/decoding) (which I fixed), all the code assumes strings and is very happy to go on assuming this, as long as the URIs are encoded with UTF-8, which they almost certainly are. Guido says: > I think the only change is to remove the encoding arguments and ... You really want me to remove the encoding= named argument? And hard-code UTF-8 into these functions? It seems like we may as well have the optional encoding argument, as it does no harm and could be of significant benefit. I'll post a patch with the unquote_to_bytes function, but leave the encoding arguments in until this point is clarified. Matt -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20080731/209fd01b/attachment.htm>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4