On Tue, Feb 9, 2016 at 3:22 AM, Victor Stinner <victor.stinner at gmail.com> wrote: > 2016-02-09 1:37 GMT+01:00 eryk sun <eryksun at gmail.com>: >> For example, in codepage 932 (Japanese), it's an error if a lead byte >> (i.e. 0x81-0x9F, 0xE0-0xFC) is followed by a trailing byte with a >> value less than 0x40 (note that ASCII 0-9 is 0x30-0x39, so this is not >> uncommon). In this case the ANSI API substitutes the default character >> for Japanese, '・' (U+30FB, Katakana middle dot). >> >> >>> locale.getpreferredencoding() >> 'cp932' >> >>> open(b'\xe05', 'w').close() >> >>> os.listdir('.') >> ['・'] >> >>> os.listdir(b'.') >> [b'\x81E'] >> >> All invalid sequences get mapped to '・', which roundtrips as >> b'\x81\x45', so you can't reliably create and open files with >> arbitrary bytes paths in this locale. > > Oh, and I forgot to ask: what is your filesystem? Is it the same > behaviour for NTFS, FAT32, network shared directories, etc.? That was tested using NTFS, but the same would apply to FAT32, exFAT, and UDF since they all use Unicode [1]. CreateFile[A|W] wraps the NtCreateFile system call. The NT executive is Unicode, so the system call receives the filename using a Unicode-only OBJECT_ATTRIBUTES [2] record. I can't say what an arbitrary non-Microsoft filesystem will do with the U+30FB character when it processes the IRP_MJ_CREATE. I was only concerned with ANSI<=>Unicode conversion that's implemented in the ntdll.dll runtime library. [1]: https://msdn.microsoft.com/en-us/library/ee681827 [2]: https://msdn.microsoft.com/en-us/library/ff557749
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4