Tim Peters wrote: > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. unless you're using a UTF-8 aware editor, of course ;-) (some days, I think we need some way to tell the compiler what encoding we're using for the source file...) > This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs > directly. So, as discussed earlier, we should follow Java's lead > and also introduce a \u escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. good idea. and by some reason, patches for this is included in the unicode distribution (see the attached str2utf.c). > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. I vote for 'outlaw'. </F> /* A small code snippet that translates \uxxxx syntax to UTF-8 text. To be cut and pasted into Python/compile.c */ /* Written by Fredrik Lundh, January 1999. */ /* Documentation (for the language reference): \uxxxx -- Unicode character with hexadecimal value xxxx. The character is stored using UTF-8 encoding, which means that this sequence can result in up to three encoded characters. Note that the 'u' must be followed by four hexadecimal digits. If fewer digits are given, the sequence is left in the resulting string exactly as given. If more digits are given, only the first four are translated to Unicode, and the remaining digits are left in the resulting string. */ #define Py_CHARMASK(ch) ch void convert(const char *s, char *p) { while (*s) { if (*s != '\\') { *p++ = *s++; continue; } s++; switch (*s++) { /* -------------------------------------------------------------------- */ /* copy this section to the appropriate place in compile.c... */ case 'u': /* \uxxxx => UTF-8 encoded unicode character */ if (isxdigit(Py_CHARMASK(s[0])) && isxdigit(Py_CHARMASK(s[1])) && isxdigit(Py_CHARMASK(s[2])) && isxdigit(Py_CHARMASK(s[3]))) { /* fetch hexadecimal character value */ unsigned int n, ch = 0; for (n = 0; n < 4; n++) { int c = Py_CHARMASK(*s); s++; ch = (ch << 4) & ~0xF; if (isdigit(c)) ch += c - '0'; else if (islower(c)) ch += 10 + c - 'a'; else ch += 10 + c - 'A'; } /* store as UTF-8 */ if (ch < 0x80) *p++ = (char) ch; else { if (ch < 0x800) { *p++ = 0xc0 | (ch >> 6); *p++ = 0x80 | (ch & 0x3f); } else { *p++ = 0xe0 | (ch >> 12); *p++ = 0x80 | ((ch >> 6) & 0x3f); *p++ = 0x80 | (ch & 0x3f); } } break; } else goto bogus; /* -------------------------------------------------------------------- */ default: bogus: *p++ = '\\'; *p++ = s[-1]; break; } } *p++ = '\0'; } main() { int i; unsigned char buffer[100]; convert("Link\\u00f6ping", buffer); for (i = 0; buffer[i]; i++) if (buffer[i] < 0x20 || buffer[i] >= 0x80) printf("\\%03o", buffer[i]); else printf("%c", buffer[i]); }
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4