Ian Bicking writes: > Just for perspective, I don't know if I've ever wanted to deal with a URL > like that. Ditto, I do many times a day for Japanese media sites and Wikipedia. > I know how it is supposed to work, and I know what a browser does > with that, but so many tools will clean that URL up *or* won't be > able to deal with it at all that it's not something I'll be passing > around. I'm not suggesting that is something you want to be "passing around"; it's a presentation form, and I prefer that the internal form use Unicode. > While it's nice to be correct about encodings, sometimes it is > impractical. And it is far nicer to avoid the situation entirely. But you cannot avoid it entirely. Processing bytes mean you are assuming ASCII compatibility. Granted, this is a pretty good assumption, especially if you got the bytes off the wire, but it's not universally so. Maybe it's a YAGNI, but one reason I prefer the decode-process-encode paradigm is that choice of codec is a specification of the assumptions you're making about encoding. So the Know-Nothing codec described above assumes just enough ASCII compatibility to parse the scheme. You could also have codecs which assume just enough ASCII compatibility to parse a hierarchical scheme, etc. > That is, decoding content you don't care about isn't just > inefficient, it's complicated and can introduce errors. That depends on the codec(s) used. > Similarly I'd expect (from experience) that a programmer using > Python to want to take the same approach, sticking with unencoded > data in nearly all situations. Indeed, a programmer using Python 2 would want to do so, because all her literal strings are bytes by default (ie, if she doesn't mark them with `u'), and interactive input is, too. This is no longer so obvious in Python 3 which takes the attitude that things that are expected to be human-readable should be processed as str. The obvious example in URI space is the file:/// URL, which you'll typically build up from a user string or a file browser, which will call the os.path stuff which returns str. Text editors and viewers will also use str for their buffers, and if they provide a way to fish out URIs for their users, they'll probably return str. I won't pretend to judge the relative importance of such use cases. But use cases for urllib which naturally favor str until you put the URI on the wire do exist, as does the debugging presentation aspect.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4