Christian Tismer wrote: > > "M.-A. Lemburg" wrote: > > > > "Andrew M. Kuchling" wrote: > > > > > > Paul Prescod writes: > > > >The new \N escape interpolates named characters within strings. For > > > >example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a > > > >unicode smiley face at the end. > > > > > > Cute idea, and it certainly means you can avoid looking up Unicode > > > numbers. (You can look up names instead. :) ) Note that this means the > > > Unicode database is no longer optional if this is done; it has to be > > > around at code-parsing time. Python could import it automatically, as > > > exceptions.py is imported. Christian's work on compressing > > > unicodedatabase.c is therefore really important. (Is Perl5.6 actually > > > dragging around the Unicode database in the binary, or is it read out > > > of some external file or data structure?) > > > > Sorry to disappoint you guys, but the Unicode name and comments > > are *not* included in the unicodedatabase.c file Christian > > is currently working on. The reason is simple: it would add > > huge amounts of string data to the file. So this is a no-no > > for the core distribution... > > This is not settled, still an open question. Well, ok, depends on how much you can sqeeze out of the text columns ;-) I still think that its better to leave these gimmicks out of the core and put them into some add-on, though. > What I have for non-textual data: > 25 kb with dumb compression > 15 kb with enhanced compression Looks good :-) With these sizes I think we could even integrate the unicodedatabase.c + API into the core interpreter and only have the unicodedata module to access the database from within Python. > What amounts of data am I talking about? > - The whole unicode database text file has size > 632 kb. > - With PkZip this goes down to > 96 kb. > > Now, I produced another text file with just the currently > used data in it, and this sounds so: > - the stripped unicode text file has size > 216 kb. > - PkZip melts this down to > 40 kb. > > Please compare that to my results above: I can do at least > twice as good. I hope I can compete for the text sections > as well (since this is something where zip is *good* at), > but just let me try. > Let's target 60 kb for the whole crap, and I'd be very pleased. > > Then, there is still the question where to put the data. > Having one file in the dll and another externally would > be an option. I could also imagine to use a binary external > file all the time, with maximum possible compression. > By loading this structure, this would be partially expanded > to make it fast. > An advantage is that the compressed Unicode database > could become a stand-alone product. The size is in fact > so crazy small, that I'd like to make this available > to any other language. You could take the unicodedatabase.c file (+ header file) and use it everywhere... I don't think it needs to contain any Python specific code. The API names would have to follow the Python naming schemes though. > > Still, the above is easily possible by inventing a new > > encoding, say unicode-with-smileys, which then reads in > > a file containing the Unicode names and applies the necessary > > magic to decode/encode data as Paul described above. > > That sounds reasonable. Compression makes sense as well here, > since the expanded stuff makes quite an amount of kb, compared > to what it is "worth", compared to, say, the Python dll. With 25kB for the non-text columns, I'd suggest simply adding the file to the core. Text columns could then go into a separate module. -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4