Fredrik Lundh wrote: > > with the locale aware encoding mechanisms switched off > (sys.getdefaultencoding() == "ascii"), I stumbled upon some > interesting behaviour: > > first something that makes sense: > > >>> u"abc" == "abc" > 1 > > >>> u"åäö" == "abc" > 0 > > but how about this one: > > >>> u"abc" == "åäö" > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeError: ASCII decoding error: ordinal not in range(128) > > or this one: > > >>> u"åäö" == "åäö" > Traceback (most recent call last): > File "<stdin>", line 1, in ? > UnicodeError: ASCII decoding error: ordinal not in range(128) > > ignoring implementation details for a moment, is this really the > best we can do? This is merely due to the fact that on your Latin-1 platform, "ä" and u"ä" map to the same ordinals. The unicode-escape codec (which is used by the Python parser) takes single characters in the whole 8-bit range as Unicode ordinals, so u"ä" really maps to unichr(ord("ä")). The alternative would be forcing usage of escapes for non-ASCII Unicode character literals and issuing an error for all non-ASCII ones. BTW, I have a feeling that we should mask the decoding errors during compares in favour of returning 0... ...otherwise the current dictionary would bomb (it doesn't do any compare error checking !) in case a Unicode string happens to have the same hash value as an 8-bit string key. (Can't test this right now, but this is what should happen according to the C sources.) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4