> > The \u escape takes up to 4 bytes > > Not in Java: it requires exactly 4 hex characters after == exactly 2 bytes, > and it's an error if it's followed by fewer than 4 hex characters. That's a > good rule (simple!), while ANSI C's is too clumsy to live with if people > want to take Unicode seriously. > > So what does it mean for a Unicode escape to appear in a non-L string? my suggestion is to store it as UTF-8; see the patches included in the unicode package for details. this also means that an u-string literal (L-string, whatever) could be stored as an 8-bit string internally. and that the following two are equivalent: string = u"foo" string = unicode("foo") also note that: unicode(str(u"whatever")) == u"whatever" ... on the other hand, this means that we have at least four major "arrays of bytes or characters" thingies mapped on two data types: the old string type is used for: -- plain old 8-bit strings (ascii, iso-latin-1, whatever) -- byte buffers containing arbitrary data -- unicode strings stored as 8-bit characters, using the UTF-8 encoding. and the unicode string type is used for: -- unicode strings stored as 16-bit characters is this reasonable? ... yet another question is how to deal with source code. is a python 1.6 source file written in ASCII, ISO Latin 1, or UTF-8. speaking from a non-us standpoint, it would be really cool if you could write Python sources in UTF-8... </F>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4