> [me] > > To summarize, here's the "character encoding guidelines" for > > Python 2.0: > >=20 > > In Unicode context, 8-bit strings contain ASCII. Characters > > in the 0x80-0xFF range are invalid. 16-bit strings contain > > UCS-2. Characters in the 0xD800-0xDFFF range are invalid. > >=20 > > If you want to use any other encoding, use the codecs pro- > > vided by the Unicode subsystem. If you need to use Unicode > > characters that cannot be represented as UCS-2, you cannot > > use Python 2.0's Unicode subsystem. > >=20 > > Anything else is just a hack. [guido] > I wouldn't go so far as raising an exception when a comparison > involves 0xD800-0xDFFF; after all we don't raise an exception when an > ASCII string contains 0x80-0xFF either (except when converting to > Unicode). that's what the "unicode context" qualifier means: 8-bit strings can contain anything, unless you're using them as unicode strings. > The invalidity of 0xD800-0xDFFF means that these aren't valid Unicode > code points; it doesn't mean that we should trap all attempts to use > these values. That ways, apps that need UTF-16 awareness can code it > themselves. >=20 > Why? Because I don't want to proliferate code that explicitly traps > 0xD800-0xDFFF throughout the code. you only need to check "on the way in", and leave it to the decoders to make sure they generate UCS-2 only. the original unicode implementation did just that, but Bill and Marc-Andre recently lowered the shields: the UTF-8 decoder now generates UTF-16 encoded data. (so does \N{}, but that's a non-issue:=20 (oddly enough, the UTF-16 decoder won't accept anything that isn't UCS-2 ;-) my proposal is to tighten this up in 2.0: ifdef out the UTF-16 code in the UTF-8 decoder, and ifdef out the UTF-16 stuff in the compare function. let's wait until 2.1 before we support the full unicode character set (and I'm pretty sure "the right way" is UCS-4 storage and a unified string implementation, but let's discuss that later). patch coming. </F>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4