Victor Stinner wrote: > Le vendredi 08 janvier 2010 05:21:04, Guido van Rossum a écrit : > (...) >> (And yes, I know this happens. Doesn't mean we need to auto-guess by >> default; there are lots of issues e.g. what should happen after >> seeking to offset 0?) > > I wrote a new version of my patch (version 3): > > * don't change the default behaviour: use open(filename, encoding="BOM") to > check the BOM is there is any > * fix for seek(0): always ignore the BOM > * add an unit test: check that the right encoding is detect, but also the the > BOM is ignored (especially after a seek(0)) > > BOM encoding doesn't work for writing into a file, so open(filename, "w", > encoding="BOM") raises a ValueError. > I think it's similar to universal newline mode. You can tell it that you're reading UTF-something-encoded text (common forms only). The preference is UTF-8, but it could be UTF-8-sig (from Windows), or possibly UTF-16/32, which really need a BOM because there are multiple bytes per codepoint, so the bytes could be big-endian or little-endian. The BOM (or signature) tells you what the encoding is, defaulting to UTF-8 if there's none. If it subsequently raises a DecodeError, then so be it! Maybe there should also be a way of determining what encoding it decided it was, so that you can then write a new file in that same encoding.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4