On 6/28/2011 9:43 AM, Victor Stinner wrote: > In Python 2, open() opens the file in binary mode (e.g. file.readline() > returns a byte string). codecs.open() opens the file in binary mode by > default, you have to specify an encoding name to open it in text mode. > > In Python 3, open() opens the file in text mode by default. (It only > opens the binary mode if the file mode contains "b".) The problem is > that open() uses the locale encoding if the encoding is not specified, > which is the case *by default*. The locale encoding can be: > > - UTF-8 on Mac OS X, most Linux distributions > - ISO-8859-1 os some FreeBSD systems > - ANSI code page on Windows, e.g. cp1252 (close to ISO-8859-1) in > Western Europe, cp952 in Japan, ... > - ASCII if the locale is manually set to an empty string or to "C", or > if the environment is empty, or by default on some systems > - something different depending on the system and user configuration... > > If you develop under Mac OS X or Linux, you may have surprises when you > run your program on Windows on the first non-ASCII character. You may > not detect the problem if you only write text in english... until > someone writes the first letter with a diacritic. > > > > As discussed before on this list, I propose to set the default encoding > of open() to UTF-8 in Python 3.3, and add a warning in Python 3.2 if > open() is called without an explicit encoding and if the locale encoding > is not UTF-8. Using the warning, you will quickly notice the potential > problem (using Python 3.2.2 and -Werror) on Windows or by using a > different locale encoding (.e.g using LANG="C"). > > I expect a lot of warnings from the Python standard library, and as many > in third party modules and applications. So do you think that it is too > late to change that in Python 3.3? One argument for changing it directly > in Python 3.3 is that most users will not notice the change because > their locale encoding is already UTF-8. > > An alternative is to: > - Python 3.2: use the locale encoding but emit a warning if the locale > encoding is not UTF-8 > - Python 3.3: use UTF-8 and emit a warning if the locale encoding is > not UTF-8... or maybe always emit a warning? > - Python 3.3: use UTF-8 (but don't emit warnings anymore) > > I don't think that Windows developer even know that they are writing > files into the ANSI code page. MSDN documentation of > WideCharToMultiByte() warns developer that the ANSI code page is not > portable, even accross Windows computers: > > "The ANSI code pages can be different on different computers, or can be > changed for a single computer, leading to data corruption. For the most > consistent results, applications should use Unicode, such as UTF-8 or > UTF-16, instead of a specific code page, unless legacy standards or data > formats prevent the use of Unicode. If using Unicode is not possible, > applications should tag the data stream with the appropriate encoding > name when protocols allow it. HTML and XML files allow tagging, but text > files do not." > > It will always be possible to use ANSI code page using > encoding="mbcs" (only work on Windows), or an explicit code page number > (e.g. encoding="cp2152"). > > -- > > The two other (rejetected?) options to improve open() are: > > - raise an error if the encoding argument is not set: will break most > programs > - emit a warning if the encoding argument is not set > > -- > > Should I convert this email into a PEP, or is it not required? I think a PEP is needed. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4