[rucksack-devel] Re: Fwd: State of the nation and heap patch

Cyrus Harmon ch-rucksack at bobobeach.com
Mon Feb 11 18:22:29 UTC 2008


Hi Arthur,

Thanks for these changes. My initial tests suggest that the new  
version is faster, but not overwhelmingly so. I'll try to do some more  
rigorous benchmarking (and profiling) and see what I can came up with.

My initial profiling efforts suggest that we're spending too much time  
looking up indexed entries in the b-tree caches. I would have thought  
that the time spent looking in the b-tree caches would be less than  
the time actually consing up the object and doing whatever else PCL  
needs to do, BICBW. More later today or tomorrow.

Thanks again,

Cyrus

On Feb 11, 2008, at 5:48 AM, Arthur Lemmens wrote:

> Cyrus Harmon wrote:
>
>> Yeah, the biggest performance problem I have is importing items from
>> the NCBI taxonomy database which consists of organism name, id,  
>> etc...
>> arranged into a tree of a million or so objects. I'll package up some
>> sort of release of this and circulate the URL to the list. Right now
>> it takes a few hours to import a million objects or so. It would be
>> nice to get this down to a few minutes.
>
> I did some work on improving Rucksack performance last week, using
> various ways of importing a 43 MB XML file (Jim Breen's Japanese
> dictionary at http://ftp.cc.monash.edu.au/pub/nihongo/JMdict.gz,
> which is basically also "a tree of a million or so objects") as
> test cases.
>
> The most important changes are that p-btrees don't use persistent
> conses anymore to represent bindings, that the default cache now
> doesn't use a queue to keep track of most-recent-use information
> and that Rucksack looks at the basic slot index information before
> it even starts digging into the btrees.
>
> These changes improved the overall performance and the maximum memory
> usage for my test cases by factors varying between 2 and 20.  To give
> you a rough idea: one representative test case (with class indexes on
> the 5 most frequently used classes and string indexes on 4 slots,
> resulting in a 300 MB rucksack) now takes 18 minutes on my machine.
>
> I'd be interested to know what kind of performance improvements you
> see with the new version (0.1.16).
>
> By the way: turning off the Rucksack garbage collector when importing
> large amounts of data is also a good idea.  But you knew that  
> already...
>
> Arthur
>




More information about the rucksack-devel mailing list