[elephant-devel] working with many millions of objects

Robert L. Read read at robertlread.net
Thu Oct 12 01:14:36 UTC 2006


On Wed, 2006-10-11 at 17:57 -0700, Red Daly wrote:

> I was importing into sleepycat using standard elephant routines.  I am 
> not aware of an 'import mode' for sleepycat, but I will look into that 
> when I have a chance. 

I do not think there is anyway to import things directly and to deal
with Elephants serialization at
the same time.  I think this is a dead end.

>  Another consideration using sleepycat is that 
> using BTrees with a large working set demands large amounts of memory  
> relative to a Hash representation.  I am unfamiliar with the internals 
> of elephant and sleepycat, but it feels like the basic access method is 
> restricting performance, which seems to be described here:
> http://www.sleepycat.com/docs/gsg/C/accessmethods.html#BTreeVSHash
> 
> My problem so far has been importing the data, which goes very fast 
> until sleepycat requires extensive disk access.  The in-memory rate is 
> reasonable and would complete in a few hours.  However, once disk 
> operations begin the import speed suggests it would take many days to 
> complete.  I have yet to perform extensive benchmarks, but I estimate 
> the instantiation rate shifts from 1800 persistent class instantiations 
> /second to 120 / s.
> 
> here are the two approaches that I hypothesize may help performance.  I 
> am admittedly unaware of innards of the two systems in question, so you 
> expert developers will know best.  If either sounds appropriate or you 
> envision another possibility for allowing this kind of scaling, I will 
> look into implementing such a system.
> 
> 1.  decreasing the size of the working set is one possibility for 
> decreasing run-time memory requirements and disk access.  I'm not sure 
> how the concept of a 'working set' translates from the sleepycat world 
> to the elephant world, but perhaps you do.

Elephant keeps very little in memory (unless you used my DCM module in
the contrib directory.)
So I have to admit I don't quite understand why any import operation
would suddenly shift to more 
disk access.  Now, doing a simple import of a lot of data WOULD tend to
leave objects in memory
that could not be garbage collected, if you weren't careful.  That could
be solved by either 
making SURE you don't keep a reference to the objects or by doing your
import in large batches
(restarting LISP processes in between times).  But I am just guesssing
--- do you feel comfortable 
posting your code, or pseudocode for how you do the import?

> 
> 2.  using a Hash instead of a BTree in the primary database?  I am 
> unsure what this means for elephant.

I don't think that you make progress with that.  Elephant depends very
deeply on the BTree, 
and, as far as I know, so does Sleepycat.

> 
> In the mean time I will depart from the every-class-is-persistent 
> approach and also use more traditional data structures.
> 
> Thanks again,
> Red Daly
> 
> 
> 
> Robert L. Read wrote:
> > Yes, it's amusing.
> >
> > In my own work I use the Postgres backend; I know very little about 
> > SleepyCat.  It seems
> > to me that this is more of a SleepyCat issue, then an Elephant issue.  
> > Perhaps you should
> > ask the SleepyCat list?
> >
> > Are you importing things into SleepyCat directly in the correct 
> > serialization format that
> > they can be read by Elephant?  If so, I assume it is just a question 
> > of solving the SleepyCat
> > problems.
> >
> > An alternative would be to use the SQL-based backend.  However, I 
> > doubt this will solve
> > your problem, since at present we (well, I wrote it) use a very 
> > inefficient serialization scheme
> > for the SQL-based backend that base64 encodes everything.  This had 
> > the advantage that
> > it makes it work trouble-free with different database backends, but 
> > could clearly be improved upon.
> > However, it is more than efficient enough for all my work, and at 
> > present nobody is clamoring
> > to have it improved.
> >
> > Is your problem importing the data or using it once it is imported?  
> > It's hard for me to imagine
> > a problem so large that even the import time is a problem --- suppose 
> > it takes 24 hours --- can
> > you not afford to pay that?
> >
> > A drastic measure and potentially expensive measure would be to switch 
> > to a 64-bit architecture
> > with a huge memory.  I intend to do that when forced by performance 
> > issues in my own work.
> >
> >
> >
> > On Tue, 2006-10-10 at 00:46 -0700, Red Daly wrote:
> >> I will be running experiments in informatics and modeling in the future 
> >> that may contain (tens or hundreds of) millions of objects.  Given the 
> >> ease of use of elephant so far, it would be great to use it as the 
> >> persistent store and avoid creating too many custom data structures.
> >>
> >> I have recently run up against some performance bottlenecks when using 
> >> elephant to work with very large datasets (in the hundreds of millions 
> >> of objects).  Using SleepyCat, I am able to import data very quickly 
> >> with a DB_CONFIG file with the following contents:
> >>
> >> set_lk_max_locks 500000
> >> set_lk_max_objects 500000
> >> set_lk_max_lockers 500000
> >> set_cachesize 1 0 0
> >>
> >> I can import data very quickly until the 1 gb cache is too small to 
> >> allow complete in-memory access to the database.  at this point it seems 
> >> that disk IO makes additional writes happen much slower.  (I have also 
> >> tried increasing the 1 gb cache size, but the database fails to open if 
> >> it is too large--e.g. 2 gbs.  I have 1.25 gb physical memory and 4 gb 
> >> swap, so the constraint seems to be physical memory.)  the max_lock, 
> >> etc. lines allow transactions to contain hundreds of thousands of 
> >> individual locks, limiting the transaction throughput bottleneck
> >>
> >> What are the technical restrictions on writing several million objects 
> >> to the datastore?  Is it feasible to create a batch import feature to 
> >> allow large datasets to be imported using reasonable amounts of memory 
> >> for a desktop computer?
> >>
> >> I hope this email is at least amusing!
> >>
> >> Thanks again,
> >> red daly
> >> _______________________________________________
> >> elephant-devel site list
> >> elephant-devel at common-lisp.net <mailto:elephant-devel at common-lisp.net>
> >> http://common-lisp.net/mailman/listinfo/elephant-devel
> >>     
> > ------------------------------------------------------------------------
> >
> > _______________________________________________
> > elephant-devel site list
> > elephant-devel at common-lisp.net
> > http://common-lisp.net/mailman/listinfo/elephant-devel
> 
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/elephant-devel/attachments/20061011/d2b99ba2/attachment.html>


More information about the elephant-devel mailing list