+1 from me! Mark On 25/10/2011 9:57 AM, Victor Stinner wrote: > Hi, > > I propose to raise Unicode errors if a filename cannot be decoded on Windows, > instead of creating a bogus filenames with questions marks. Because this change > is incompatible with Python 3.2, even if such filenames are unusable and I > consider the problem as a (Python?) bug, I would like your opinion on such > change before working on a patch. > > -- > > Windows works internally on Unicode strings since Windows 95 (or something > like that), but provides also an "ANSI" API using the ANSI code page and byte > strings for backward compatibility. It was already proposed to drop completly > the bytes API in our nt (os) module, but it may break the Python backward > compatibility (and it is difficult to list Python programs using the bytes API > to access the file system). > > The ANSI API uses MultiByteToWideChar (decode) and WideCharToMultiByte > (encode) functions in the default mode (flags=0): MultiByteToWideChar() > replaces undecodable bytes by '?' and WideCharToMultiByte() ignores > unencodable characters (!!!). This behaviour produces invalid filenames (see > for example the issue #13247) and *the user is unable to detect codec errors*. > > In Python 3.2, I changed the MBCS codec to make it strict: it raises a > UnicodeEncodeError if a character cannot be encoded to the ANSI code page > (e.g. encode Ł to cp1252) and a UnicodeDecodeError if a character cannot be > decoded from the ANSI code page (e.g. b'\xff' from cp932). > > I propose to reuse our MBCS codec in strict mode (error handler="strict"), to > notice directly encode/decode errors, with the Windows native (wide character) > API. It should simplify the source code: replace 2 versions of a function by 1 > version + optional code to decode arguments and/or encode the result. > > -- > > Read also the previous thread: > > [Python-Dev] Byte filenames in the posix module on Windows > Wed Jun 8 00:23:20 CEST 2011 > http://mail.python.org/pipermail/python-dev/2011-June/111831.html > > -- > > FYI I patched again Python MBCS codec: it now handles correclty ignore and > replace mode (to encode and decode), but now also supports any error handler. > > -- > > We might use the PEP 383 to store undecoable bytes as surrogates (U+DC80- > U+DCFF). But the situation is the opposite of the situtation on UNIX: on > Windows, the problem is more on encoding (text->bytes) than on decoding > (bytes->text). On UNIX, problems occur when the system is misconfigured (e.g. > wrong locale encoding). On Windows, problems occur when your application uses > the old (ANSI) API, whereas your filesystem is fully Unicode compliant and you > created Unicode filenames with a program using the new (Windows) API. > > Only few programs are fully Unicode compliant. A lot of programs fail if a > filename cannot be encoded to the ANSI code page (just 2 examples: Mercurial > and Visual Studio). > > Victor > _______________________________________________ > Python-Dev mailing list > Python-Dev at python.org > http://mail.python.org/mailman/listinfo/python-dev > Unsubscribe: http://mail.python.org/mailman/options/python-dev/skippy.hammond%40gmail.com
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4