James Y Knight writes: > The surrogateescape method is a nice workaround for this, but I can't > help thinking that it might've been better to just treat stuff as > possibly-invalid-but-probably-utf8 byte-strings from input, through > processing, to output. This is the world we already have, modulo s/utf8/ascii + random GR charset/. It doesn't work, and it can't, in Japan or China or Korea, and probably not in Russia or Kazakhstan, for some time yet. That's not to say that byte-oriented processing doesn't have its place. And in many cases it's reasonable (but not secure or bulletproof!) to assume ASCII compatibility of the byte stream, passing through syntactically unimportant bytes verbatim. Syntactic analysis of such streams will surely have a lot in common with that for text streams, so the same tools should be available. (That's the point of Guido's endorsement of polymorphism, AIUI.) But it's just not reasonable to assume that will work in a context where text streams from various sources are mixed with byte streams. In that case, the byte streams need to be converted to text before mixing. (You can't do it the other way around because there is no guarantee that the text is compatible with the current encoding of the byte stream, nor that all the byte streams have the same encoding.) We do need str-based implementations of modules like urllib.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4