Mike Brown wrote: > 1. urlopen() cannot reliably process unicode unless there are no > percent-encoded octets above %7F and no characters above \u007f > (I think that's the gist of it, at least). And that feature is by design. URLs are conceptually byte strings, not character strings, so passing Unicode strings is mostly a meaningless operation. Mostly - because if the Unicode string is pure ASCII, it probably matches most implementations and user expectations to convert it to pure ASCII first, and then treat it as a URL. IETF is working on resolving the issue, by introducing IRIs. It appears that draft-duerst-iri-09.txt is what will become the relevant RFC. Once the RFC is published, urllib and urllib2 should be updated to support IRIs; contributions are welcome. > I don't think this is necessarily a bug, as a proper URI will never contain > non-ASCII characters. However since urlopen()'s API is unfortunately such that > it accepts OS-specific filesystem paths, which nowadays may be unicode, it may > be time to tighten up the API and say that the url argument *must* be a URI, > and that if unicode is given, it will be converted to str and thus must not > contain non-ASCII characters. No. I'ld rather prefer to specify that it if it is a Unicode string, it must be an IRI, and is converted to an URI according to the IRI spec. > 2. urlopen() (the URI scheme-specific openers it uses, actually) does not > percent-decode the host portion of a URL before doing a DNS lookup. > > This wasn't really a problem until IDNs came along; no one was using non-ASCII > in their hostnames. But now we have to deal with URLs where the host component > is a string of percent-encoded UTF-8 octets. Hmm. I think there is no backup in any standard for doing that. Applications that put URL-escaped UTF-8 bytes into host names deserve to lose. There are two valid ways for putting non-ASCII characters into the hostname part of an URL: use Unicode strings, or use IDNA. It may be that IRIs add another way (I haven't checked this aspect specifically), but unless there is some RFC supporting such a protocol, any response by urllib is fine, exceptions preferred. > Even though IDNs are the main application for percent-encoded octets in the > host component, it is necessary in simpler cases as well, like > > 'http://www.w%33.org' > > which would need to be interpreted as > > 'http://www.w3.org' We would have to check: this might be valid usage, but I somewhat doubt it. > urllib's urlopeners were *not* updated accordingly. This should be changed. The change was deliberately deferred until the IRI RFC is published. > 3. On Windows, urlopen() only recognizes '|' as a Windows drivespec character, > whereas ':' is just as, if not more, common in 'file' URIs. I have long ago given up trying to understand this issue. I'm happy to change this forth and back about once or twice a year, until somebody comes up with a clear and definitive story, backed up by standards and product documentation, so that we might get a stable implementation some day. Feel free to write patches. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4