Victor Stinner wrote: > Le vendredi 08 janvier 2010 10:10:23, Martin v. Löwis a écrit : >>> Builtin open() function is unable to open an UTF-16/32 file starting with >>> a BOM if the encoding is not specified (raise an unicode error). For an >>> UTF-8 file starting with a BOM, read()/readline() returns also the BOM >>> whereas the BOM should be "ignored". >> It depends. If you use the utf-8-sig encoding, it *will* ignore the >> UTF-8 signature. > > Sure, but it means that you only use UTF-8+BOM files. If you get UTF-8 and > UTF-8+BOM files, you have to to detect the encoding (not an easy job) or to > remove the BOM after the first read (much harder if you use a module like > ConfigParser or csv). > >>> Since my proposition changes the result TextIOWrapper.read()/readline() >>> for files starting with a BOM, we might introduce an option to open() to >>> enable the new behaviour. But is it really needed to keep the backward >>> compatibility? >> Absolutely. And there is no need to produce a new option, but instead >> use the existing options: define an encoding that auto-detects the >> encoding from the family of BOMs. Maybe you call it encoding="sniff". > > Good idea, I choosed open(filename, encoding="BOM"). On the surface this looks like there's an encoding named "BOM", but looking at your patch I found that the check is still done in TextIOWrapper. IMHO the best approach would to the implement a *real* codec named "BOM" (or "sniff"). This doesn't require *any* changes to the IO library. It could even be developed as a standalone project and published in the Cheeseshop. To see how something like this can be done, take a look at the UTF-16 codec, that switches to bigendian or littleendian mode depending on the first read/decode call. Servus, Walter
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4