Mike Brown wrote: > No. The intent is actually that a URI is (not conceptually, just *is*) a > string of characters You are right: URIs are meant to be written on paper. However, RFC 2396 also acknowledges that the issue of non-ASCII characters is unresolved. It suggests (in 2.1) that the URI scheme should specify the interpretation of byte values. > This was actually clear in RFC 2396 sections 1.5 and 2, but has been explained > somewhat better in the rephrased section 2 of rfc2396bis, which is in Last > Call. This suggests that new URI schemes should mandate UTF-8 in the components, but is silent on the issue of existing schemes. > The question is, does the url argument to urlopen() purport to be or is it > assumed to be a URL? The function is quite lenient about what it accepts as a > URL -- it accepts pretty much anything you give it, be it unicode or str, with > or without a scheme component, relative to some unknown base, and loaded with > illegal characters, and it tries to deal with it as best it can -- yet it > still rejects or inconsistently handles some valid URIs, and this is what I > want to see changed. If something passed to it is clearly a valid URL, and there is a clear definition of how a computer should process it, and urllib doesn't, than this is certainly a bug and should be fixed. Can you give an example of such a URL? > Perhaps I should rephrase part of the issue this way: If the argument to > urlopen() is assumed to be a URI, then %FF in the argument should not be > interpreted any differently when the argument is a str vs when it is unicode. Certainly. Indeed, urllib makes no difference, AFAICT. "http://localhost/%FF" and u"http://localhost/%FF" are processed in the same way. > RFC 2396 left it ambiguous as to what characters are represented by %80-%FF, > so an implementation thereof may make such interpretations as it pleases. > The current implementation doesn't do this in a consistent manner. No. RFC 2396 defers the specifications to the specific schema. >>Applications that put URL-escaped UTF-8 bytes into host names deserve to >>lose. > > > Come February or whenever rfc2396bis and the IRI draft become RFCs, that > will no longer be a position you can maintain. I see. I think I could accept a patch in this direction for Python 2.4 even if RFC2396bis isn't published, assuming the patch arrives before 2.4b1. > Let me be clear though - I am not suggesting getting rid of support for '|'. > I am merely saying that there is no reason ':' should, on Windows, fail to > be treated the same as '|' for the purpose of representing the ':' in a > drivespec. I know that I personally won't touch this code, except for applying patches. So if you have a clear vision of what needs to be changed and how, submit a patch. As for using regular expressions in the standard library: It seems you believe this is discouraged. I don't know why you think so - I've never heard of such a constraint before (in general - in specific cases, submitters may have been told that alternatives are more efficient). Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4