> > Why is getting Unicode worse than getting MBCS? #3 looks right to me... > > If people do > > out = open("names.txt","w") > for f in os.listdir("."): > print >>out, f > > then this will print all filenames in mbcs. Under your proposed > changed, it will raise a UnicodeError. OK, you've convinced me. I guess the best compromise then is 8-bit in, MBCS out, and Unicode in, Unicode out. > > I still don't fully understand MBCS. I know there's a variable > > assignment of codes to the upper half of the 8-bit space, based on a > > user setting. But is that always a simply mapping to 128 non-ASCII > > characters, or are there multi-byte codes that expand the total > > character set to more than 256? > > Yes, the "mbcs" might be truly multibyte. Microsoft calls it the "ANSI > code page", CP_ACP, which varies with the localization. They currently > use: > > code page region encoding style > 1250 Central Europe 8-bit > 1251 Cyrillic 8-bit > 1252 Western Europe 8-bit > 1253 Greek 8-bit > 1254 Turkish 8-bit > 1255 Hebrew 8-bit > 1256 Arabic 8-bit > 1257 Baltic 8-bit > 1258 Vietnamese 8-bit > > 874 Thai multi-byte > 932 Japan Shift-JIS, multi-byte > 936 Simplified Chinese GB2312, multi-byte > 949 Korea multi-byte > 950 Traditional Chinese BIG5, multi-byte > > The multi-byte codes fall in two categories: those that use bytes <128 > for multi-byte codes (e.g. 950) and those that don't (e.g. 932); the > latter ones restrict themselves to bytes >=128 for multi-byte > characters (I believe this is what the Shift in Shift-JIS tries to > indicate). Aha! So MBCS is not an encoding: it's an indirection for a variety of encodings. (Is there a way to find out what the encoding is?) > > > For readlink, if you trust FileSystemDefaultEncoding, you could return > > > a Unicode object if you find non-ASCII in the link contents. > > > > What is FileSystemDefaultEncoding and when can you trust it? > > It's a global variable (really called Py_FileSystemDefaultEncoding), > introduced by Mark Hammond, and should be set to the encoding that the > operating system uses to encode file names, on the file system API. > > On Windows, this is reliably CP_ACP/"mbcs". Do you mean that the condition on #if defined(HAVE_LANGINFO_H) && defined(CODESET) is reliably false on Windows? Otherwise _locale.setlocale() could set it. > On Unix, it is the locale's encoding by convention, which is set > only if setlocale(LC_CTYPE,"") was called. Some Unix users may not > follow the convention, or may have file names which cannot be > represented in their locale's encoding. So as long as they use 8-bit it's not our problem, right. Another reason to avoid prodicing Unicode without a clue that the app expects Unicode (alas). (BTW I find a Unicode argument to os.listdir() a sufficient clue. IOW os.listdir(u".") should return Unicode.) > > Wide + Unicode (if non-ASCII) sounds good to me. The fewer places an > > app has to deal with MBCS the better, it seems to me. > > Ok, I'll update the PEP. To what? (It would be bad if I convinced you at the same time you convinced me of the opposite. :-) > You may have been under the impression that MBCS is only relevant in > Far East, so let me stress this point: It applies to all windows > versions, e.g. a user of a French installation who has a file named > C:\Docs\Boulot\SéminaireLORIA-jan2002\DemoCORBA (bug #509117) > currently gets a byte string when listing C:\Docs\Boulot, but will > get a Unicode string under the modified PEP 277. No, I was aware of that part. I guess they should get MBCS on os.listdir('C:\\Docs\\Boulot') but Unicode on os.listdir(u'C:\\Docs\\Boulot'). --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4