Zooko O'Whielacronx wrote: > Following-up to my own post to correct a major error: > > > On Thu, Apr 30, 2009 at 11:44 PM, Zooko O'Whielacronx <zookog at gmail.com> wrote: >> Folks: >> >> My use case (Tahoe-LAFS [1]) requires that I am *able* to read arbitrary >> binary names from the filesystem and store them so that I can regenerate >> the same byte string later, but it also requires that I *know* whether >> what I got was a valid string in the expected encoding (which might be >> utf-8) or whether it was not and I need to fall back to storing the >> bytes. > > Okay, I am wrong about this. Having a flag to remember whether I had to > fall back to the utf-8b trick is one method to implement my requirement, > but my actual requirement is this: > > Requirement: either the unicode string or the bytes are faithfully > transmitted from one system to another. > > That is: if you read a filename from the filesystem, and transmit that > filename to another system and use it, then there are two cases: > > Requirement 1: the byte string was valid in the encoding of source > system, in which case the unicode name is faithfully transmitted > (i.e. the bytes that finally land on the target system are the result of > sourcebytes.decode(source_sys_encoding).encode(target_sys_encoding). > > Requirement 2: the byte string was not valid in the encoding of source > system, in which case the bytes are faithfully transmitted (i.e. the > bytes that finally land on the target system are the same as the bytes > that originated in the source system). > > Now I finally understand how fiendishly clever MvL's PEP 383 > generalization of Markus Kuhn's utf-8b trick is! The only thing > necessary to achieve both of those requirements above is that the > 'python-escape' error handler is used on the target system .encode() as > well as on the source system .decode()! > > Well, I'm going to have to let this sink in and maybe write some code to > see if I really understand it. > > But if this is right, then I can do away with some of the mechanism that > I've built up, and instead: > > Backport PEP 383 to Python 2. > > And, document the PEP 383 trick in some generic, widely respected format > such as an Internet Draft so that I can explain to other users of the > Tahoe data (many of whom use other languages than Python) what they have > to do if they find invalid utf-8 in the data. Oh good, I just realized > that Tahoe emits only utf-8, so all I have to do is point them to the > utf-8b documents (such as they are) and explain that to read filenames > produced by Tahoe they have to implement utf-8b. That's really good > that they don't have to implement MvL's generalization of that trick to > other encodings, since utf-8b is already understood by some folks. > > > Okay, I find it surprisingly easy to make subtle errors in this encoding > stuff, so please let me know if you spot one. Is it true that > srcbytes.encode(srcencoding, 'python-escape').decode('utf-8', > 'python-escape') will always produce srcbytes ? That is my Requirement > 2. > No, but srcbytes.encode('utf-8', 'python-escape').decode('utf-8', 'python-escape') == srcbytes. The encodings on both ends need to be the same. For example: >>> b'\x80'.decode('windows-1252') u'\u20ac' >>> u'\u20ac'.encode('utf-8') '\xe2\x82\xac' Currently: >>> b'\x80'.decode('utf-8') Traceback (most recent call last): File "<pyshell#7>", line 1, in <module> b'\x80'.decode('utf-8') File "C:\Python26\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: unexpected code byte But under this PEP: >>> b'x80'.decode('utf-8', 'python-escape') u'\xdc80' >>> u'\xdc80'.encode('utf-8', 'python-escape') '\x80'
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4