Christian Tismer wrote: > > Fredrik Lundh wrote: > > > > CT: > > > How do I build a dist that doesn't need to change a lot of > > > stuff in the user's installation? > > > > somewhere in this thread, Guido wrote: > > > > > BTW, I added a tag "pre-unicode" to the CVS tree to the revisions > > > before the Unicode changes were made. > > > > maybe you could base SLP on that one? > > I have no idea how this works. Would this mean that I cannot > get patctes which come after unicode? > > Meanwhile, I've looked into the sources. It is easy for me > to get rid of the problem by supplying my own unicodedata.c, > where I replace all functions by some unimplemented exception. No need (see my other posting): simply disable the module altogether... this shouldn't hurt any part of the interpreter as the module is a user-land only module. > Furthermore, I wondered about the data format. Is the unicode > database used inyou re package as well? Otherwise, I see > only references form unicodedata.c, and that means the data > structure can be massively enhanced. > At the moment, that baby is 64k entries long, with four bytes > and an optional string. > This is a big waste. The strings are almost all some distinct > <xxx> prefixes, together with a list of hex smallwords. This > is done as strings, probably this makes 80 percent of the space. I have made no attempt to optimize the structure... (due to lack of time mostly) the current implementation is really not much different from a rewrite of the UnicodeData.txt file availble at the unicode.org site. If you want to, I can mail you the marshalled Python dict version of that database to play with. > The only function that uses the "decomposition" field (namely > the string) is unicodedata_decomposition. It does nothing > more than to wrap it into a PyObject. > We can do a little better here. I gues I can bring it down > to a third of this space without much effort, just by using > - binary encoding for the <xxx> tags as enumeration > - binary encoding of the hexed entries > - omission of the spaces > Instead of a 64 k of structures which contain pointers anyway, > I can use a 64k pointer array with offsets into one packed > table. > > The unicodedata access functions would change *slightly*, > just building some hex strings and so on. I guess this > is not a time critical section? It may be if these functions are used in codecs, so you should pay attention to speed too... > Should I try this evening? :-) Sure :-) go ahead... -- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4