Tim Peters wrote: > > > Marc-Andre Lemburg has a proposal for work that I'm asking him to do > > (under pressure from HP who want Python i18n badly and are willing to > > pay!): http://starship.skyport.net/~lemburg/unicode-proposal.txt > > I can't make time for a close review now. Just one thing that hit my eye > early: > > Python should provide a built-in constructor for Unicode strings > which is available through __builtins__: > > u = unicode(<encoded Python string>[,<encoding name>= > <default encoding>]) > > u = u'<utf-8 encoded Python string>' > > Two points on the Unicode literals (u'abc'): > > UTF-8 is a very nice encoding scheme, but is very hard for people "to do" by > hand -- it breaks apart and rearranges bytes at the bit level, and > everything other than 7-bit ASCII requires solid strings of "high-bit" > characters. This is painful for people to enter manually on both counts -- > and no common reference gives the UTF-8 encoding of glyphs directly. So, as > discussed earlier, we should follow Java's lead and also introduce a \u > escape sequence: > > octet: hexdigit hexdigit > unicodecode: octet octet > unicode_escape: "\\u" unicodecode > > Inside a u'' string, I guess this should expand to the UTF-8 encoding of the > Unicode character at the unicodecode code position. For consistency, then, > it should probably expand the same way inside "regular strings" too. Unlike > Java does, I'd rather not give it a meaning outside string literals. It would be more conform to use the Unicode ordinal (instead of interpreting the number as UTF8 encoding), e.g. \u03C0 for Pi. The codes are easy to look up in the standard's UnicodeData.txt file or the Unicode book for that matter. > The other point is a nit: The vast bulk of UTF-8 encodings encode > characters in UCS-4 space outside of Unicode. In good Pythonic fashion, > those must either be explicitly outlawed, or explicitly defined. I vote for > outlawed, in the sense of detected error that raises an exception. That > leaves our future options open. See my other post for a discussion of UCS4 vs. UTF16 vs. UCS2. Perhaps we could add a flag to Unicode objects stating whether the characters can be treated as UCS4 limited to the lower 16 bits (UCS4 and UTF16 are the same in most ranges). This flag could then be used to choose optimized algorithms for scanning the strings. Fredrik's implementation currently uses UCS2, BTW. > BTW, is ord(unicode_char) defined? And as what? And does ord have an > inverse in the Unicode world? Both seem essential. Good points. How about uniord(u[:1]) --> Unicode ordinal number (32-bit) unichr(i) --> Unicode object for character i (provided it is 32-bit); ValueError otherwise They are inverse of each other, but note that Unicode allows private encodings too, which will of course not necessarily make it across platforms or even from one PC to the next (see Andy Robinson's interesting case study). I've uploaded a new version of the proposal (0.3) to the URL: http://starship.skyport.net/~lemburg/unicode-proposal.txt Thanks, -- Marc-Andre Lemburg ______________________________________________________________________ Y2000: 51 days left Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4