RetroSearch Browse

Thu Apr 26 20:03:50 EDT 2001 · http://mail.python.org/pipermail/python-list/2001-April/070598.html

Brian Kelley wrote:

> I am using a dictionary to hold
> 
> {string:[list of integers]}
> 
> key: values
> 
> The dictionary is randomly accessed to add an integer to the
> list referenced by the string.
> 
> I have been exploring several differnent caching strategies
> ranging from
> bsddb3 to the metakit toolkit for caching/retrieving these to
> disk.  One of the problems that I am facing is that because
> the dictionary is randomly accessed after the database gets to
> a certain size performance
> goes WAY down with every strategy that I have tried.  (Note
> that
> integers are never removed if that helps)  I have had success
> speed-wise with MySQL but I simply cannot deliver that in my
> application, it's way too costly to administer.
> 
> Has anyone examined a similar problem?  Thanks in advance.
> 
> Brian Kelley
> 
> 
> 

        Here I go shooting from the hip without knowing all of the 
facts. (To give a proper answer I really should know how many 
bytes is each key/pointer pair, how many records will the 
database contain, and at what point do you hit the performance 
wall.)

        Assuming that data access is truly random the least expensive 
way of storing and accessing (assuming you are going to roll 
your own and not go with an existing product) is to add all new 
records at the end of the file, mark, but do not remove deleted 
records, and build indexes to access the records.

        The trick is building and maintaining the indexes.  The 
solution is to build a multi-level index structure that is not 
fully populated.  

        Each index entry will consist of three fields; the unique key 
for the record, the byte offset of the record in the file, and 
a one byte indicator telling wether this index entry addresses 
a lower level index block or a data record.  The data and each 
level of the index should be in separate files.  

        Assume that each such entry is 32 bytes long and that your OS 
block size is 8K.  Rather than fully populate each block with 
256 entries populate each index block with 250 entries.  

        To build the index create a sorted list for the entire 
database and then populate the bottom level of the index.  
Next, build a sorted list consisting of the first entry in each 
block of the lower level.  (Note that in this and any higher 
level index blocks the indicator must be set to show that this 
entry is pointing to an index block and not a data blcok.)  
Build the second level index.  At this point consider the 
following each block in the second level index will point to 
250 blocks in the first level index which will in turn point to 
250 data records.  If the second level index is held in memory 
each 2nd level block will allow you to access 62,500 records 
with no more than two disk accesses.  If a third level index is 
needed you wil be able to access any one of 15,625,000 records 
with no more than three disk accesses.

        The reason for not fully populating the database and the 
indexes is to allow for expansion without total reorganization. 
 The problem with this scheme being that if you need to add 
seven records to an index block it is necessary to split the 
block, that is create a new index block and move half of the 
entries in the split block into the new block.  In general this 
should only happen at the first level, but you must still make 
provision for either splitting or total index reorganization.

        Hope this helps.  Also, if you do implement it please 
contribute the code.

-- 
Jerry Gitomer
Once I learned how to spell DBA, I became one       

Home - News ( United States | United Kingdom | Italy | Germany ) - Football scores

Showing content from http://mail.python.org/pipermail/python-list/2001-April/070598.html below:

Caching Strategies/Database question