On Jan 8, 2010, at 4:14 PM, Tres Seaver wrote: >> I understood this proposal as a general processing guideline, not >> something the io library should do (but, say, a text editor). >> >> FWIW, I'm personally in favor of using the UTF-8 signature. If people >> consider them crazy talk, that may be because UTF-8 can't possibly >> have >> a byte order - hence I call it a signature, not the BOM. As a >> signature, >> I don't consider it crazy at all. There is a long tradition of having >> magic bytes in files (executable files, Postscript, PDF, ... - see >> /etc/magic). Having a magic byte sequence for plain text to denote >> the >> encoding is useful and helps reducing moji-bake. This is the reason >> it's >> used on Windows: notepad would normally assume that text is in the >> ANSI >> code page, and for compatibility, it can't stop doing that. So the >> UTF-8 >> signature gives them an exit strategy. > > Agreed. Having that marker at the start of the file makes interop > with > other tools *much* easier. Putting the BOM at the beginning of UTF-8 text files is not a good idea, it makes interop much *worse* on a unix system, not better. Without the BOM, most commands do the right thing with UTF-8 text. E.g. to concatenate two files: $ cat file-1 file-2 > file-3 With a BOM at the beginning of the file, it won't work right. Of course, you could modify "cat" (and every other stream processing command) to know how to consume and emit BOMs, and omit the extra one that would show up in the middle of the stream...but even that can't work; what about: $ (cat file-1; cat file-2) > file-3. Should the shell now know that when you run multiple commands, it should eat the BOM emitted from the second command? Basically, using a BOM in a utf-8 file is just not a good idea: it completely ruins interop with every standard unix tool. This is not to say that Python shouldn't have a way to read a file with a UTF-8 BOM: it just shouldn't encourage you to *write* such files. James
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4