Hi, I have spent the last four days on compressing the Unicode database. With little decoding effort, I can bring the data down to 25kb. This would still be very fast, since codes are randomly accessible, although there are some simple shifts and masks. With a bit more effort, this can be squeezed down to 15kb by some more aggressive techniques like common prefix elimination. Speed would be *slightly* worse, since a small loop (average 8 cycles) is performed to obtain a character from a packed nybble. This is just all the data which is in Marc's unicodedatabase.c file. I checked efficiency by creating a delimited file like the original database text file with only these columns and ran PkZip over it. The result was 40kb. This says that I found a lot of correlations which automatic compressors cannot see. Now, before generating the final C code, I'd like to ask some questions: What is more desirable: Low compression and blinding speed? Or high compression and less speed, since we always want to unpack a whole code page? Then, what about the other database columns? There are a couple of extra atrributes which I find coded as switch statements elsewhere. Should I try to pack these codes into my squeezy database, too? And last: There are also two quite elaborated columns with textual descriptions of the codes (the uppercase blah version of character x). Do we want these at all? And if so, should I try to compress them as well? Should these perhaps go into a different source file as a dynamic module, since they will not be used so often? waiting for directives - ly y'rs - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4