"M.-A. Lemburg" wrote: > > "Andrew M. Kuchling" wrote: > > > > Paul Prescod writes: > > >The new \N escape interpolates named characters within strings. For > > >example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a > > >unicode smiley face at the end. > > > > Cute idea, and it certainly means you can avoid looking up Unicode > > numbers. (You can look up names instead. :) ) Note that this means the > > Unicode database is no longer optional if this is done; it has to be > > around at code-parsing time. Python could import it automatically, as > > exceptions.py is imported. Christian's work on compressing > > unicodedatabase.c is therefore really important. (Is Perl5.6 actually > > dragging around the Unicode database in the binary, or is it read out > > of some external file or data structure?) > > Sorry to disappoint you guys, but the Unicode name and comments > are *not* included in the unicodedatabase.c file Christian > is currently working on. The reason is simple: it would add > huge amounts of string data to the file. So this is a no-no > for the core distribution... This is not settled, still an open question. What I have for non-textual data: 25 kb with dumb compression 15 kb with enhanced compression What amounts of data am I talking about? - The whole unicode database text file has size 632 kb. - With PkZip this goes down to 96 kb. Now, I produced another text file with just the currently used data in it, and this sounds so: - the stripped unicode text file has size 216 kb. - PkZip melts this down to 40 kb. Please compare that to my results above: I can do at least twice as good. I hope I can compete for the text sections as well (since this is something where zip is *good* at), but just let me try. Let's target 60 kb for the whole crap, and I'd be very pleased. Then, there is still the question where to put the data. Having one file in the dll and another externally would be an option. I could also imagine to use a binary external file all the time, with maximum possible compression. By loading this structure, this would be partially expanded to make it fast. An advantage is that the compressed Unicode database could become a stand-alone product. The size is in fact so crazy small, that I'd like to make this available to any other language. > Still, the above is easily possible by inventing a new > encoding, say unicode-with-smileys, which then reads in > a file containing the Unicode names and applies the necessary > magic to decode/encode data as Paul described above. That sounds reasonable. Compression makes sense as well here, since the expanded stuff makes quite an amount of kb, compared to what it is "worth", compared to, say, the Python dll. ciao - chris -- Christian Tismer :^) <mailto:tismer@appliedbiometrics.com> Applied Biometrics GmbH : Have a break! Take a ride on Python's Kaunstr. 26 : *Starship* http://starship.python.net 14163 Berlin : PGP key -> http://wwwkeys.pgp.net PGP Fingerprint E182 71C7 1A9D 66E9 9D15 D3CC D4D7 93E2 1FAE F6DF we're tired of banana software - shipped green, ripens at home
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4