[MAL] > > > u"..." currently interprets the characters it finds as Latin-1 > > > (this is by design, since the first 256 Unicode ordinals map to > > > the Latin-1 characters). [GvR] > > Nice, except that now we seem to be ambiguous about the source > > character encoding: it's Latin-1 for Unicode strings and UTF-8 for > > 8-bit strings...! [MAL] > Noo... there is no definition for non-ASCII 8-bit strings in > Python source code using the ordinal range 127-255. If you were > to define Latin-1 as source code encoding, then we would have > to change auto-coercion to make a Latin-1 assumption instead, but... > I see the picture: people are getting pretty confused about what > is going on. > > If you write u"xyz" then the ordinals of those characters are > taken and stored directly as Unicode characters. If you live > in a Latin-1 world, then you happen to be lucky: the Unicode > characters match your input. If not, some totally different > characters are likely to show if the string were written > to a file and displayed using a Unicode aware editor. > > The same will happen to your normal 8-bit string literals. > Nothing unusual so far... if you use Latin-1 strings and > write them to a file, you get Latin-1. If you happen to > program on DOS, you'll get the DOS ANSI encoding for the > German umlauts. > > Now the key point where all this started was that > u'ä' in 'äöü' will raise an error due to 'äöü' being > *interpreted* as UTF-8 -- this doesn't mean that 'äöü' > will be interpreted as UTF-8 elsewhere in your application. > > The UTF-8 assumption had to be made in order to get the two > worlds to interoperate. We could have just as well chosen > Latin-1, but then people currently using say a Russian > encoding would get upset for the same reason. > > One way or another somebody is not going to like whatever > we choose, I'm afraid... the simplest solution is to use > Unicode for all strings which contain non-ASCII characters > and then call .encode() as necessary. I have a different view on this (except that I agree that it's pretty confusing :-). In my definition of a "source character encoding", string literals, whether Unicode or 8-bit strings, are translated from the source encoding to the corresponding run-time values. If I had a C compiler that read its source in EBCDIC but cross-compiled to a machine that used ASCII, I would expect that 'a' in the source would have the integer value 97 (ASCII 'a'), regardless of the EBCDIC value for 'a'. If I type a non-ASCII Latin-1 character in a Unicode literal, it generates the corresponding Unicode character. This means to me that the source character encoding is Latin-1. But when I type the same character in an 8-bit character literal, that literal is interpreted as UTF-8 (e.g. when converting to Unicode using the default conversions). Thus, even though you can do whatever you want with 8-bit literals in your program, the most defensible view is that they are UTF-8 encoded. I would be much happier if all source code was encoded in the same encoding, because otherwise there's no good way to view such code in a general Unicode-aware text viewer! My preference would be to always use UTF-8. This would mean no change for 8-bit literals, but a big change for Unicode literals... And a break with everyone who's currently typing Latin-1 source code and using strings as Latin-1. (Or Latin-7, or whatever.) My next preference would be a pragma to define the source encoding, but that's a 1.7 issue. Maybe the whole thing is... :-( --Guido van Rossum (home page: http://www.python.org/~guido/)
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4