>>>>> "Martin" == Martin v Loewis <martin@v.loewis.de> writes: Martin> Reliable detection of encodings is a good thing, though, I would think that UTF-8 can be quite reliably detected without the "BOM". I suppose you could construct short ambiguous sequences easily for ISO-8859-[678] (which are meaningful in the corresponding natural language), but it seems that even a couple dozen characters would make the odds astronomical that "in the wild" syntactic UTF-8 is intended to be UTF-8 Unicode (assuming you're expecting a text file, such as Python source). Is that wrong? Have you any examples? I'd be interested to see them; we (XEmacs) have some ideas about "statistical" autodetection of encodings, and they'd be useful test cases. Martin> as the Web has demonstrated. But the Web in general provides (mandatory) protocols for identifying content-type, yet I regularly see HTML files with incorrect http-equiv meta elements, and XHTML with no encoding declaration containing Shift JIS. Microsoft software for Japanese apparently ignores Content-Type headers and the like in favor of autodetection (probably because the same MS software regularly relies on users to set things like charset parameters in MIME Content-Type). I can't tell my boss that his mail is ill-formed (well, not to any effect). So I'd really love to be able to watch his face when Python 2.3 tells him his program is not legally encoded. But I guess that's not convincing enough reason for Guido to mandate UTF-8.<wink> -- Institute of Policy and Planning Sciences http://turnbull.sk.tsukuba.ac.jp University of Tsukuba Tennodai 1-1-1 Tsukuba 305-8573 JAPAN Don't ask how you can "do" free software business; ask what your business can "do for" free software.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4