At 11:31 PM -0400 01-05-2000, Guido van Rossum wrote: >Here's one usage scenario. > >A Japanese user is reading lines from a file encoded in ISO-2022-JP. >The readline() method returns 8-bit strings in that encoding (the file >object doesn't do any decoding). She realizes that she wants to do >some character-level processing on the file so she decides to convert >the strings to Unicode. > >I believe that whether the default encoding is UTF-8 or Latin-1 >doesn't matter for here -- both are wrong, she needs to write explicit >unicode(line, "iso-2022-jp") code anyway. I would argue that UTF-8 is >"better", because interpreting ISO-2022-JP data as UTF-8 will most >likely give an exception (when a \300 range byte isn't followed by a >\200 range byte) -- while interpreting it as Latin-1 will silently do >the wrong thing. (An explicit error is always better than silent >failure.) But then it's even better to *always* raise an exception, since it's entirely possible a string contains valid utf-8 while not *being* utf-8. I really think the exception argument is moot, since there can *always* be situations that will pass silently. Encoding issues are silent by nature -- eg. there's no way any system can tell that interpreting MacRoman data as Latin-1 is wrong, maybe even fatal -- the user will just have to deal with it. You can argue what you want, but *any* multi-byte encoding stored in an 8-bit string is a buffer, not a string, for all the reasons Fredrik and Paul have thrown at you, and right they are. Choosing such an encoding as a default conversion to Unicode makes no sense at all. Recap of the main arguments: pro UTF-8: always reversible when going from Unicode to 8-bit con UTF-8: not a string: confusing semantics pro Latin-1: simpler semantics con Latin-1: non-reversible, western-centric Given the fact that very often *both* will be wrong, I'd go for the simpler semantics. Just
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4