Tim Peters wrote: > > [/F, dripping with code] > > ... > > Note that the 'u' must be followed by four hexadecimal digits. If > > fewer digits are given, the sequence is left in the resulting string > > exactly as given. > > Yuck -- don't let probable error pass without comment. "must be" == "must > be"! I second that. > [moving backwards] > > \uxxxx -- Unicode character with hexadecimal value xxxx. The > > character is stored using UTF-8 encoding, which means that this > > sequence can result in up to three encoded characters. > > The code is fine, but I've gotten confused about what the intent is now. > Expanding \uxxxx to its UTF-8 encoding made sense when MAL had UTF-8 > literals, but now he's got Unicode-escaped literals instead -- and you favor > an internal 2-byte-per-char Unicode storage format. In that combination of > worlds, is there any use in the *language* (as opposed to in a runtime > module) for \uxxxx -> UTF-8 conversion? No, no... :-) I think it was a simple misunderstanding... \uXXXX is only to be used within u'' strings and then gets expanded to *one* character encoded in the internal Python format (which is heading towards UTF-16 without surrogates). > And MAL, if you're listening, I'm not clear on what a Unicode-escaped > literal means. When you had UTF-8 literals, the meaning of something like > > u"a\340\341" > > was clear, since UTF-8 is defined as a byte stream and UTF-8 string literals > were just a way of specifying a byte stream. As a Unicode-escaped string, I > assume the "a" maps to the Unicode "a", but what of the rest? Are the octal > escapes to be taken as two separate Latin-1 characters (in their role as a > Unicode subset), or as an especially clumsy way to specify a single 16-bit > Unicode character? I'm afraid I'd vote for the former. Same issue wrt \x > escapes. Good points. The conversion goes as follows: · for single characters (and this includes all \XXX sequences except \uXXXX), take the ordinal and interpret it as Unicode ordinal · for \uXXXX sequences, insert the Unicode character with ordinal 0xXXXX instead > One other issue: are there "raw" Unicode strings too, as in ur"\u20ac"? > There probably should be; and while Guido will hate this, a ur string should > probably *not* leave \uxxxx escapes untouched. Nasties like this are why > Java defines \uxxxx expansion as occurring in a preprocessing step. Not sure whether we really need to make this even more complicated... The \uXXXX strings look ugly, adding a few \\\\ for e.g. REs or filenames won't hurt much in the context of those \uXXXX monsters :-) > BTW, the meaning of \uxxxx in a non-Unicode string is now also unclear (or > isn't \uxxxx allowed in a non-Unicode string? that's what I would do ...). Right. \uXXXX will only be allowed in u'' strings, not in "normal" strings. BTW, if you want to type in UTF-8 strings and have them converted to Unicode, you can use the standard: u = unicode('...string with UTF-8 encoded characters...','utf-8') -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 50 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4