Nick Coghlan, 24.08.2011 15:06: > On Wed, Aug 24, 2011 at 10:46 AM, Terry Reedy wrote: >> In utf16.py, attached to http://bugs.python.org/issue12729 >> I propose for consideration a prototype of different solution to the 'mostly >> BMP chars, few non-BMP chars' case. Rather than expand every character from >> 2 bytes to 4, attach an array cpdex of character (ie code point, not code >> unit) indexes. Then for indexing and slicing, the correction is simple, >> simpler than I first expected: >> code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) >> where code-unit-index is the adjusted index into the full underlying >> double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids >> most of the space penalty and the consequent time penalty of moving more >> bytes around and increasing cache misses. > > Interesting idea, but putting on my C programmer hat, I say -1. > > Non-uniform cell size = not a C array = standard C array manipulation > idioms don't work = pain (no matter how simple the index correction > happens to be). > > The nice thing about PEP 383 is that it gives us the smallest storage > array that is both an ordinary C array and has sufficiently large > individual elements to handle every character in the string. +1 Stefan
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4