On 6/4/2014 2:28 PM, Chris Angelico wrote: > On Thu, Jun 5, 2014 at 6:50 AM, Glenn Linderman <v+python at g.nevcal.com> wrote: >> 8) (Content specific variable size caches) Index each codepoint that is a >> different byte size than the previous codepoint, allowing indexing to be >> used in the intervals. Worst case size is like 2, best case size is a single >> entry for the end, when all code points are represented by the same number >> of bytes. > Conceptually interesting, and I'd love to know how well that'd perform > in real-world usage. So would I :) > Would do very nicely on blocks of text that are > all from the same range of codepoints, but if you intersperse high and > low codepoints it'll be like 2 but with significantly more complicated > lookups (imagine a "name=value\nname=value\n" stream where the names > and values are all in the same language - you'll have a lot of > transitions). Lookup is binary search on code point index or a search for same in some tree structure, I would think. "like 2 but ..." well, the data structure would be bigger than for 2, but your example shows 4-5 high codepoints per low codepoint (for some languages). I did just think of another refinement to this technique (my list was not intended to be all-inclusive... just a bunch of variations I thought of then). 10) (Content specific variable size caches) Like 8, but the last character in a run is allowed (but not required) to be a different number of bytes than prior characters, because the offset calculation will still work for the first character of a different size. So #10 would halve the size of your imagined stream that intersperses one low-byte charater with each sequence of high-byte characters. -------------- next part -------------- An HTML attachment was scrubbed... URL: <http://mail.python.org/pipermail/python-dev/attachments/20140604/e70449bc/attachment.html>
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4