Zooko O'Whielacronx wrote: >> If you switch to iso8859-15 only in the presence of undecodable >> UTF-8, then you have the same round-trip problem as the PEP: both >> b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a >> way to unambiguously recover the original file name. > > Why do you say that? It seems to work as I expected here: > > >>> '\xff'.decode('iso-8859-15') > u'\xff' > >>> '\xc3\xbf'.decode('iso-8859-15') > u'\xc3\xbf' Here is what I mean by "switch to iso8859-15" only in the presence of undecodable UTF-8: def file_name_to_unicode(fn, encoding): try: return fn.decode(encoding) except UnicodeDecodeError: return fn.decode('iso-8859-15') Now, assume a UTF-8 locale and try to use it on the provided example file names. >>> file_name_to_unicode(b'\xff', 'utf-8') 'ÿ' >>> file_name_to_unicode(b'\xc3\xbf', 'utf-8') 'ÿ' That is the ambiguity I was referring to -- to different byte sequences result in the same unicode string.
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4