RetroSearch Browse

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from https://mail.python.org/pipermail/python-dev/2000-March/002697.html below:

[Python-Dev] Unicode Database Compression

[Python-Dev] Unicode Database CompressionChristian Tismer tismer@tismer.com
Tue, 21 Mar 2000 21:13:38 +0100

Previous message: [Python-Dev] Unicode and Windows
Next message: [Python-Dev] Re: Unicode Database Compression
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

Hi,

I have spent the last four days on compressing the
Unicode database.

With little decoding effort, I can bring the data down to 25kb.
This would still be very fast, since codes are randomly
accessible, although there are some simple shifts and masks.

With a bit more effort, this can be squeezed down to 15kb
by some more aggressive techniques like common prefix
elimination. Speed would be *slightly* worse, since a small
loop (average 8 cycles) is performed to obtain a character
from a packed nybble.

This is just all the data which is in Marc's unicodedatabase.c
file. I checked efficiency by creating a delimited file like
the original database text file with only these columns and
ran PkZip over it. The result was 40kb. This says that I found
a lot of correlations which automatic compressors cannot see.

Now, before generating the final C code, I'd like to ask some
questions:

What is more desirable: Low compression and blinding speed?
Or high compression and less speed, since we always want to
unpack a whole code page?

Then, what about the other database columns?
There are a couple of extra atrributes which I find coded
as switch statements elsewhere. Should I try to pack these
codes into my squeezy database, too?

And last: There are also two quite elaborated columns with
textual descriptions of the codes (the uppercase blah version
of character x). Do we want these at all? And if so, should
I try to compress them as well? Should these perhaps go
into a different source file as a dynamic module, since they
will not be used so often?

waiting for directives - ly y'rs - chris

-- 
Christian Tismer             :^)   <mailto:tismer@appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaunstr. 26                  :    *Starship* http://starship.python.net
14163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home

Previous message: [Python-Dev] Unicode and Windows
Next message: [Python-Dev] Re: Unicode Database Compression
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]

RetroSearch is an open source project built by @garambo | Open a GitHub Issue

Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo

HTML: 3.2 | Encoding: UTF-8 | Version: 0.7.4