A RetroSearch Logo

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Search Query:

Showing content from https://mail.python.org/pipermail/python-dev/2009-April/089192.html below:

[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces

[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces [Python-Dev] PEP 383: Non-decodable Bytes in System Character InterfacesHrvoje Niksic hrvoje.niksic at avl.com
Wed Apr 29 10:29:32 CEST 2009
Zooko O'Whielacronx wrote:
>> If you switch to iso8859-15 only in the presence of undecodable  
>> UTF-8, then you have the same round-trip problem as the PEP: both  
>> b'\xff' and b'\xc3\xbf' will be converted to u'\u00ff' without a  
>> way to unambiguously recover the original file name.
> 
> Why do you say that?  It seems to work as I expected here:
> 
>  >>> '\xff'.decode('iso-8859-15')
> u'\xff'
>  >>> '\xc3\xbf'.decode('iso-8859-15')
> u'\xc3\xbf'

Here is what I mean by "switch to iso8859-15" only in the presence of 
undecodable UTF-8:

def file_name_to_unicode(fn, encoding):
     try:
         return fn.decode(encoding)
     except UnicodeDecodeError:
         return fn.decode('iso-8859-15')

Now, assume a UTF-8 locale and try to use it on the provided example 
file names.

 >>> file_name_to_unicode(b'\xff', 'utf-8')
'ÿ'
 >>> file_name_to_unicode(b'\xc3\xbf', 'utf-8')
'ÿ'

That is the ambiguity I was referring to -- to different byte sequences 
result in the same unicode string.
More information about the Python-Dev mailing list

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4