Ka-Ping Yee wrote: > > On Tue, 4 Jul 2000, M.-A. Lemburg wrote: > > > > The reasoning at the time was that dictionaries should accept > > Unicode objects as keys which match their string equivalents > > as the same key, e.g. 'abc' works just as well as u'abc'. > [...] > > According to the docs, objects comparing equal should have the > > same hash value, yet this would require the hash value to be > > calculated using the default encoding and that > > would not only cause huge performance problems, but could > > effectively render Unicode useless, > > Given the new 7-bit-ASCII-as-default-encoding-for-8-bit-strings > convention, shouldn't just hashing the character values work > fine? That is, hash('abc') should == hash(u'abc'), no conversion > required. Yes, and it does so already for pure ASCII values. The problem comes from the fact that the default encoding can be changed to a locale specific value (site.py does the lookup for you), e.g. given you have defined LANG to be us_en, Python will default to Latin-1 as default encoding. This results in 'äöü' == u'äöü', but hash('äöü') != hash(u'äöü'), which is in conflict with the general rule about objects having the same hash value if they compare equal. Now, I could go and change the internal cache buffer to hold the default encoding instead of UTF-8, but this would affect not only hash(), but also the 's' and 't' parser markers, etc. ... I wonder why compiling "print u'\uD800'" causes the hash value to be computed ... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4