Bill Tutt wrote: > > > MAL wrotw: > >> Bill wrote: > >> u'\ud800' causes the interpreter to crash > >> example: > >> print u'\ud800' > >> What happens: > >> The code failes to compile because while adding the constant, the > unicode_hash > >> function is called which for some reason requires the UTF-8 string > format. > > > The reasoning at the time was that dictionaries should accept > > Unicode objects as keys which match their string equivalents > > as the same key, e.g. 'abc' works just as well as u'abc'. > > > UTF-8 was the default encoding back then. I'm not sure how > > to fix the hash value given the new strategy w/r to the > > default encoding... > > > According to the docs, objects comparing equal should have the > > same hash value, yet this would require the hash value to be > > calculated using the default encoding and that > > would not only cause huge performance problems, but could > > effectively render Unicode useless, because not all default > > encodings are lossless (ok, one could work around this by > > falling back to some other way of calculating the hash > > value in case the conversion fails). > > Yeah, yeah, yeah. I know all that, just never liked it. :) > The current problem is that the UTF-8 can't round trip surrogate characters > atm. > This is easy to fix, so I'm doing a patch to fix this oversight, unless you > beat me to it. > > Anything else is slightly more annoying to fix. I left out surrogates in the UTF-8 codec on purpose: the Python implementation currently doesn't have support for these, e.g. slicing doesn't care about UTF-16 surrogates, so I made sure that people using these get errors ;-) -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4