MAL wrote: >Andrew M. Kuchling" wrote: >> >> Paul Prescod writes: >>>The new \N escape interpolates named characters within strings. For >>>example, "Hi! \N{WHITE SMILING FACE}" evaluates to a string with a >>>unicode smiley face at the end. >> >> Cute idea, and it certainly means you can avoid looking up Unicode >> numbers. (You can look up names instead. :) ) Note that this means the >> Unicode database is no longer optional if this is done; it has to be >> around at code-parsing time. Python could import it automatically, as >> exceptions.py is imported. Christian's work on compressing >> unicodedatabase.c is therefore really important. (Is Perl5.6 actually >> dragging around the Unicode database in the binary, or is it read out >> of some external file or data structure?) > > Sorry to disappoint you guys, but the Unicode name and comments > are *not* included in the unicodedatabase.c file Christian > is currently working on. The reason is simple: it would add > huge amounts of string data to the file. So this is a no-no > for the core distribution... > Ok, now you're just being silly. Its possible to put the character names in a separate structure so that they don't automatically get paged in with the normal unicode character property data. If you never use it, it won't get paged in, its that simple.... Looking up the Unicode code value from the Unicode character name smells like a good time to use gperf to generate a perfect hash function for the character names. Esp. for the Unicode 3.0 character namespace. Then you can just store the hashkey -> Unicode character mapping, and hardly ever need to page in the actual full character name string itself. I haven't looked at what the comment field contains, so I have no idea how useful that info is. *waits while gperf crunches through the ~10,550 Unicode characters where this would be useful* Bill
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4