On Tue, Jan 25, 2011 at 10:22:41AM +0100, Xavier Morel wrote: > On 2011-01-25, at 04:26 , Toshio Kuratomi wrote: > > > > * If you can pick a set of encodings that are valid (utf-8 for Linux and > > MacOS > > HFS+ uses UTF-16 in NFD (actually in an Apple-specific variant of NFD). Right here you've already broken Python modules on OSX. > Others have been saying that Mac OSX's HFS+ uses UTF-8. But the question is not whether UTF-16 or UTF-8 is used by HFS+. It's whether you can sensibly decide on an encoding from the type of system that is being run on. This could be querying the filesystem or a check on sys.platform or some other method. I don't know what detection the current code does. On Linux there's no defined encoding that will work; file names are just bytes to the Linux kernel so based on people's argument that the convention is and should be that filenames are utf-8 and anything else is a misconfigured system -- python should mandate that its module filenames on Linux are utf-8 rather than using the user's locale settings. > > And as far as I know, Linux software/FS generally use NFC (I've already seen this issue cause trouble) > Linux FS's are bytes with a small blacklist (so you can't use the NULL byte in a filename, for instance). Linux software would be free to use any normal form that they want. If one software used NFC and another used NFD, the FS would record two separate files with two separate filenames. Other programs might or might not display this correctly. Example: <zsh>$ touch cafe <zsh>$ python Python 2.7 (r27:82500, Sep 16 2010, 18:02:00) >>> import os >>> import unicodedata >>> a=u'café' >>> b=unicodedata.normalize('NFC', a) >>> c=unicodedata.normalize('NFD', a) >>> open(b.encode('utf8'), 'w').close() >>> open(c.encode('utf8'), 'w').close() >>> os.listdir(u'.') >>> [u'people-etc-changes.txt', u'cafe\u0301', u'cafe', u'people-etc-changes.sha256sum', u'caf\xe9'] >>> os.listdir('.') >>> ['people-etc-changes.txt', 'cafe\xcc\x81', 'cafe', 'people-etc-changes.sha256sum', 'caf\xc3\xa9'] >>> ^D <zsh>$ ls -al . drwxrwxr-x. 2 badger badger 4096 Jan 25 07:46 . drwxr-xr-x. 17 badger badger 4096 Jan 24 18:27 .. -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 café <zsh>$ ls -al cafe -rw-rw-r--. 1 badger badger 0 Jan 25 07:45 cafe <zsh>$ ls -al cafe? -rw-rw-r--. 1 badger badger 0 Jan 25 07:46 cafe Now in this case, the decomposed form of the filename is being displayed incorrectly and the shell treats the decomposed character as two characters instead of one. However, when you view these files in dolphin (the KDE file manager) you properly see café repeated twice. -Toshio -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 198 bytes Desc: not available URL: <http://mail.python.org/pipermail/python-dev/attachments/20110125/9963a847/attachment.pgp>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4