[rucksack-devel] Re: Fwd: State of the nation and heap patch

Mon Feb 11 13:48:10 UTC 2008

Cyrus Harmon wrote:

> Yeah, the biggest performance problem I have is importing items from
> the NCBI taxonomy database which consists of organism name, id, etc...
> arranged into a tree of a million or so objects. I'll package up some
> sort of release of this and circulate the URL to the list. Right now
> it takes a few hours to import a million objects or so. It would be
> nice to get this down to a few minutes.

I did some work on improving Rucksack performance last week, using
various ways of importing a 43 MB XML file (Jim Breen's Japanese
dictionary at http://ftp.cc.monash.edu.au/pub/nihongo/JMdict.gz,
which is basically also "a tree of a million or so objects") as
test cases.

The most important changes are that p-btrees don't use persistent
conses anymore to represent bindings, that the default cache now
doesn't use a queue to keep track of most-recent-use information
and that Rucksack looks at the basic slot index information before
it even starts digging into the btrees.

These changes improved the overall performance and the maximum memory
usage for my test cases by factors varying between 2 and 20.  To give
you a rough idea: one representative test case (with class indexes on
the 5 most frequently used classes and string indexes on 4 slots,
resulting in a 300 MB rucksack) now takes 18 minutes on my machine.

I'd be interested to know what kind of performance improvements you
see with the new version (0.1.16).

By the way: turning off the Rucksack garbage collector when importing
large amounts of data is also a good idea.  But you knew that already...

Arthur