On 8/23/2011 6:20 AM, "Martin v. Löwis" wrote: > Am 23.08.2011 11:46, schrieb Xavier Morel: >> Mostly ascii is pretty common for western-european languages (French, for >> instance, is probably 90 to 95% ascii). It's also a risk in english, when >> the writer "correctly" spells foreign words (résumé and the like). > > I know - I still question whether it is "extremely common" (so much as > to justify a special case). I.e. on what application with what dataset > would you gain what speedup, at the expense of what amount of extra > lines, and potential slow-down for other datasets? [snip] > In the PEP 393 approach, if the string has a two-byte representation, > each character needs to widened to two bytes, and likewise for four > bytes. So three separate copies of the unrolled loop would be needed, > one for each target size. I fully support the declared purpose of the PEP, which I understand to be to have a full,correct Unicode implementation on all new Python releases without paying unnecessary space (and consequent time) penalties. I think the erroneous length, iteration, indexing, and slicing for strings with non-BMP chars in narrow builds needs to be fixed for future versions. I think we should at least consider alternatives to the PEP393 solution of double or quadrupling space if needed for at least one char. In utf16.py, attached to http://bugs.python.org/issue12729 I propose for consideration a prototype of different solution to the 'mostly BMP chars, few non-BMP chars' case. Rather than expand every character from 2 bytes to 4, attach an array cpdex of character (ie code point, not code unit) indexes. Then for indexing and slicing, the correction is simple, simpler than I first expected: code-unit-index = char-index + bisect.bisect_left(cpdex, char_index) where code-unit-index is the adjusted index into the full underlying double-byte array. This adds a time penalty of log2(len(cpdex)), but avoids most of the space penalty and the consequent time penalty of moving more bytes around and increasing cache misses. I believe the same idea would work for utf8 and the mostly-ascii case. The main difference is that non-ascii chars have various byte sizes rather than the 1 extra double-byte of non-BMP chars in UCS2 builds. So the offset correction would not simply be the bisect-left return but would require another lookup byte-index = char-index + offsets[bisect-left(cpdex, char-index)] If possible, I would have the with-index-array versions be separate subtypes, as in utf16.py. I believe either index-array implementation might benefit from a subtype for single multi-unit chars, as a single non-ASCII or non-BMP char does not need an auxiliary [0] array and a senseless lookup therein but does need its length fixed at 1 instead of the number of base array units. -- Terry Jan Reedy
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4