[/F] >here's the description: Thanks. >From: "Fredrik Lundh" <effbot@telia.com> >Date: Sun, 16 Jul 2000 20:40:46 +0200 > >/.../ > > The unicodenames database consists of two parts: a name > database which maps character codes to names, and a code > database, mapping names to codes. > >* The Name Database (getname) > > First, the 10538 text strings are split into 42193 words, > and combined into a 4949-word lexicon (a 29k array). I only added a word to the lexicon if it was used more than once and if the length was larger then the lexicon index. I ended up with 1385 entries in the lexicon. (a 7k array) > Each word is given a unique index number (common words get > lower numbers), and there's a "lexicon offset" table mapping > from numbers to words (10k). My lexicon offset table is 3k and I also use 4k on a perfect hash of the words. > To get back to the original text strings, I use a "phrase > book". For each original string, the phrase book stores a a > list of word numbers. Numbers 0-127 are stored in one byte, > higher numbers (less common words) use two bytes. At this > time, about 65% of the words can be represented by a single > byte. The result is a 56k array. Because not all words are looked up in the lexicon, I used the values 0-38 for the letters and number, 39-250 are used for one byte lexicon index, and 251-255 are combined with following byte to form a two byte. This also result in a 57k array So far it is only minor variations. > The final data structure is an offset table, which maps code > points to phrase book offsets. Instead of using one big > table, I split each code point into a "page number" and a > "line number" on that page. > > offset = line[ (page[code>>SHIFT]<<SHIFT) + (code&MASK) ] > > Since the unicode space is sparsely populated, it's possible > to split the code so that lots of pages gets no contents. I > use a brute force search to find the optimal SHIFT value. > > In the current database, the page table has 1024 entries > (SHIFT is 6), and there are 199 unique pages in the line > table. The total size of the offset table is 26k. > >* The code database (getcode) > > For the code table, I use a straight-forward hash table to store > name to code mappings. It's basically the same implementation > as in Python's dictionary type, but a different hash algorithm. > The table lookup loop simply uses the name database to check > for hits. > > In the current database, the hash table is 32k. I chose to split a unicode name into words even when looking up a unicode name. Each word is hashed to a lexicon index and a "phrase book string" is created. The sorted phrase book is then search with a binary search among 858 entries that can be address directly followed by a sequential search among 12 entries. The phrase book search index is 8k and a table that maps phrase book indexes to codepoints is another 20k. The searching I do makes jython slower then the direct calculation you do. I'll take another look at this after jython 2.0 to see if I can improve performance with your page/line number scheme and a total hashing of all the unicode names. regards, finn
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4