On 22-Apr-08, at 12:30 AM, Martin v. Löwis wrote: >> IMO, encoding estimation is something that many web programs will >> have >> to deal with > Can you please explain why that is? Web programs should not normally > have the need to detect the encoding; instead, it should be specified > always - unless you are talking about browsers specifically, which > need to support web pages that specify the encoding incorrectly. Two cases come immediately to mind: email and web forms. When a web browser POSTs data, there is no standard way of communicating which encoding it's using. There are some hints which make it easier (accept-charset attributes, the encoding used to send the page to the browser), but no guarantees. Email is a smaller problem, because it usually has a helpful content- type header, but that's no guarantee. Now, at the moment, the only data I have to support this claim is my experience with DrProject in non-English locations. If I'm the only one who has had these sorts of problems, I'll go back to "Unicode for Dummies". >> so it might as well be built in; I would prefer the option >> to run `text=input.encode('guess')` (or something similar) than >> relying >> on an external dependency or worse yet using a hand-rolled algorithm. > Ok, let me try differently then. Please feel free to post a patch to > bugs.python.org, and let other people rip it apart. > For example, I don't think it should be a codec, as I can't imagine it > working on streams. As things frequently are, it seems like this is a much larger problem that I originally believed. I'll go back and take another look at the problem, then come back if new revelations appear.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4