Brian Kelley wrote: > I am using a dictionary to hold > > {string:[list of integers]} > > key: values > > The dictionary is randomly accessed to add an integer to the > list referenced by the string. > > I have been exploring several differnent caching strategies > ranging from > bsddb3 to the metakit toolkit for caching/retrieving these to > disk. One of the problems that I am facing is that because > the dictionary is randomly accessed after the database gets to > a certain size performance > goes WAY down with every strategy that I have tried. (Note > that > integers are never removed if that helps) I have had success > speed-wise with MySQL but I simply cannot deliver that in my > application, it's way too costly to administer. > > Has anyone examined a similar problem? Thanks in advance. > > Brian Kelley > > > Here I go shooting from the hip without knowing all of the facts. (To give a proper answer I really should know how many bytes is each key/pointer pair, how many records will the database contain, and at what point do you hit the performance wall.) Assuming that data access is truly random the least expensive way of storing and accessing (assuming you are going to roll your own and not go with an existing product) is to add all new records at the end of the file, mark, but do not remove deleted records, and build indexes to access the records. The trick is building and maintaining the indexes. The solution is to build a multi-level index structure that is not fully populated. Each index entry will consist of three fields; the unique key for the record, the byte offset of the record in the file, and a one byte indicator telling wether this index entry addresses a lower level index block or a data record. The data and each level of the index should be in separate files. Assume that each such entry is 32 bytes long and that your OS block size is 8K. Rather than fully populate each block with 256 entries populate each index block with 250 entries. To build the index create a sorted list for the entire database and then populate the bottom level of the index. Next, build a sorted list consisting of the first entry in each block of the lower level. (Note that in this and any higher level index blocks the indicator must be set to show that this entry is pointing to an index block and not a data blcok.) Build the second level index. At this point consider the following each block in the second level index will point to 250 blocks in the first level index which will in turn point to 250 data records. If the second level index is held in memory each 2nd level block will allow you to access 62,500 records with no more than two disk accesses. If a third level index is needed you wil be able to access any one of 15,625,000 records with no more than three disk accesses. The reason for not fully populating the database and the indexes is to allow for expansion without total reorganization. The problem with this scheme being that if you need to add seven records to an index block it is necessary to split the block, that is create a new index block and move half of the entries in the split block into the new block. In general this should only happen at the first level, but you must still make provision for either splitting or total index reorganization. Hope this helps. Also, if you do implement it please contribute the code. -- Jerry Gitomer Once I learned how to spell DBA, I became one
RetroSearch is an open source project built by @garambo | Open a GitHub Issue
Search and Browse the WWW like it's 1997 | Search results from DuckDuckGo
HTML:
3.2
| Encoding:
UTF-8
| Version:
0.7.4