Guido van Rossum <guido@python.org> writes: > Why is getting Unicode worse than getting MBCS? #3 looks right to me... If people do out =3D open("names.txt","w") for f in os.listdir("."): print >>out, f then this will print all filenames in mbcs. Under your proposed changed, it will raise a UnicodeError. > I still don't fully understand MBCS. I know there's a variable > assignment of codes to the upper half of the 8-bit space, based on a > user setting. But is that always a simply mapping to 128 non-ASCII > characters, or are there multi-byte codes that expand the total > character set to more than 256? Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI code page", CP_ACP, which varies with the localization. They currently use: code page region encoding style 1250 Central Europe 8-bit 1251 Cyrillic 8-bit 1252 Western Europe 8-bit 1253 Greek 8-bit 1254 Turkish 8-bit 1255 Hebrew 8-bit 1256 Arabic 8-bit 1257 Baltic 8-bit 1258 Vietnamese 8-bit 874 Thai multi-byte 932 Japan Shift-JIS, multi-byte 936 Simplified Chinese GB2312, multi-byte 949 Korea multi-byte 950 Traditional Chinese BIG5, multi-byte The multi-byte codes fall in two categories: those that use bytes <128 for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the latter ones restrict themselves to bytes >=3D128 for multi-byte characters (I believe this is what the Shift in Shift-JIS tries to indicate). > > For readlink, if you trust FileSystemDefaultEncoding, you could return > > a Unicode object if you find non-ASCII in the link contents. >=20 > What is FileSystemDefaultEncoding and when can you trust it? It's a global variable (really called Py_FileSystemDefaultEncoding), introduced by Mark Hammond, and should be set to the encoding that the operating system uses to encode file names, on the file system API. On Windows, this is reliably CP_ACP/"mbcs". On Unix, it is the locale's encoding by convention, which is set only if setlocale(LC_CTYPE,"") was called. Some Unix users may not follow the convention, or may have file names which cannot be represented in their locale's encoding. > Wide + Unicode (if non-ASCII) sounds good to me. The fewer places an > app has to deal with MBCS the better, it seems to me. Ok, I'll update the PEP. You may have been under the impression that MBCS is only relevant in Far East, so let me stress this point: It applies to all windows versions, e.g. a user of a French installation who has a file named C:\Docs\Boulot\S=E9minaireLORIA-jan2002\DemoCORBA (bug #509117) currently gets a byte string when listing C:\Docs\Boulot, but will get a Unicode string under the modified PEP 277. Regards, Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4