Isaac Morland, 26.08.2011 04:28: > On Thu, 25 Aug 2011, Guido van Rossum wrote: >> I'm not sure what should happen with UTF-8 when it (in flagrant >> violation of the standard, I presume) contains two separately-encoded >> surrogates forming a valid surrogate pair; probably whatever the UTF-8 >> codec does on a wide build today should be good enough. Similarly for >> encoding to UTF-8 on a wide build if one managed to create a string >> containing a surrogate pair. Basically, I'm for a >> garbage-in-garbage-out approach (with separate library functions to >> detect garbage if the app is worried about it). > > If it's called UTF-8, there is no decision to be taken as to decoder > behaviour - any byte sequence not permitted by the Unicode standard must > result in an error (although, of course, *how* the error is to be reported > could legitimately be the subject of endless discussion). There are > security implications to violating the standard so this isn't just > legalistic purity. > > Hmmm, doesn't look good: > > Python 2.6.1 (r261:67515, Jun 24 2010, 21:47:49) > [GCC 4.2.1 (Apple Inc. build 5646)] on darwin > Type "help", "copyright", "credits" or "license" for more information. > >>> '\xed\xb0\x80'.decode ('utf-8') > u'\udc00' > >>> > > Incorrect! Although this is a narrow build - I can't say what the wide > build would do. Works the same for me in a wide Py2.7 build, but gives me this in Py3: Python 3.1.2 (r312:79147, Sep 27 2010, 09:57:50) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> b'\xed\xb0\x80'.decode ('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-2: illegal encoding Same for current Py3.3 and the PEP393 build (although both have a better exception message now: "UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-1: invalid continuation byte"). Stefan
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4