> Marc-Andre Lemburg has a proposal for work that I'm asking him to do > (under pressure from HP who want Python i18n badly and are willing to > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt I can't make time for a close review now. Just one thing that hit my eye early: Python should provide a built-in constructor for Unicode strings which is available through __builtins__: u = unicode(<encoded Python string>[,<encoding name>= <default encoding>]) u = u'<utf-8 encoded Python string>' Two points on the Unicode literals (u'abc'): UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by hand -- it breaks apart and rearranges bytes at the bit level, and everything other than 7-bit ASCII requires solid strings of "high-bit" characters. This is painful for people to enter manually on both counts -- and no common reference gives the UTF-8 encoding of glyphs directly. So, as discussed earlier, we should follow Java's lead and also introduce a \u escape sequence: octet: hexdigit hexdigit unicodecode: octet octet unicode_escape: "\\u" unicodecode Inside a u'' string, I guess this should expand to the UTF-8 encoding of the Unicode character at the unicodecode code position. For consistency, then, it should probably expand the same way inside "regular strings" too. Unlike Java does, I'd rather not give it a meaning outside string literals. The other point is a nit: The vast bulk of UTF-8 encodings encode characters in UCS-4 space outside of Unicode. In good Pythonic fashion, those must either be explicitly outlawed, or explicitly defined. I vote for outlawed, in the sense of detected error that raises an exception. That leaves our future options open. BTW, is ord(unicode_char) defined? And as what? And does ord have an inverse in the Unicode world? Both seem essential. international-in-spite-of-himself-ly y'rs - tim
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4