Lennart Regebro wrote: > On Mon, Jan 11, 2010 at 11:37, Walter Dörwald <walter at livinglogic.de> wrote: >> UTF-8 might be a good choice > > No, fallback if there is no BOM should be the local settings, just as > fallback is today if you don't specify a codec. > I mean, what if you want to look for a BOM but fall back to something > else? How far will we go with encoding special information in the > codecs names? codec='BOM else UTF-16 else locale'? :-) > > BOM is not a locale, and should not be a locale. Having a locale > called BOM is wrong per se. It should either be default to look for a > BOM when codec=None, or a special parameter. If none of these are > desired, then we need a special function that takes a filename or file > handle, and looks for a BOM and returns the codec found or None. But > I find that much less natural and obvious than checking the BOM when codec=None. > Or pass a function that accepts a byte stream or the first few bytes and returns the encoding and any unused bytes (because the byte stream might not be seekable)? def guess_encoding(byte_stream): data = byte_stream.read(2) if data == b"\xFE\xFF": return "UTF-16BE", b"" return "UTF-8", data text_file = open(filename, encoding=guess_encoding) ...
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4