On 16/09/2010, Guido van Rossum <guido at python.org> wrote: > On Thu, Sep 16, 2010 at 11:16 AM, Toshio Kuratomi <a.badger at gmail.com> > wrote: >> You were talking about encodings that were supersets of 7-bit ASCII. >> I think Martin was demonstrating a byte string that was a superset of >> 7-bit >> ASCII being fed to a stdlib function which went wrong. > > Whoops, sorry. I don't have access to Windows so I can't reproduce > this though. I also don't understand it. What is the Unicode codepoint > for that 十 character? What is sys.getfilesystemencoding()? What is the > value of "C:\\十".encode(sys.getfilesystemencoding())? My fault, should have been clearer. I was trying to demonstrate that there's a difference between the unix-friendly encodings like UTF-8 and the EUC codecs which only use high-bit characters for non-ascii text, and the ISO-2022 codecs and Shift JIS. In the example I gave, 十 encodes in CP932 as '\x8f\\', and the function gets confused by the second byte. Obviously the right answer there is just to use unicode, rather than write a function that works with weird multibyte codecs. Martin
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4