[elephant-devel] working with many millions of objects

Ian Eslick eslick at csail.mit.edu
Thu Oct 12 01:12:48 UTC 2006


Can you tell me a little bit about what the import operations look like
- that is to say how many objects are created per-transaction, how many
slots per created object, etc?  Things are only cached in-memory during
a transaction.  To ensure ACID properties (unless you've turned off
synchronization) every transaction should flush to disk just prior to
completion.  It sounds almost like you're doing a giant transaction, or
perhaps I have the scale wrong and it's the BTree cached index memory
that is eating up all your working memory.

The concept of a working set, is the number of distinct 'pages' touched
during a transaction (or set of transactions)
In elephant every unique slot access will hit a different page, but
every slot access that is nearby in the BTree index will or may share
storage. 

However import by it's very nature is a linear operation, there is
(roughly) no locality as every record is new - so you'll be allocating
lots of new pages and re-balancing the BTrees quite a bit.  Until I have
a better sense of how you are using transactions it's harder to be more
helpful.  My own DB is about 6GB but I've built it up over a long time
with alot of large records. 

Thanks,
Ian



Red Daly wrote:
> I was importing into sleepycat using standard elephant routines.  I am
> not aware of an 'import mode' for sleepycat, but I will look into that
> when I have a chance.  Another consideration using sleepycat is that
> using BTrees with a large working set demands large amounts of memory 
> relative to a Hash representation.  I am unfamiliar with the internals
> of elephant and sleepycat, but it feels like the basic access method
> is restricting performance, which seems to be described here:
> http://www.sleepycat.com/docs/gsg/C/accessmethods.html#BTreeVSHash
>
> My problem so far has been importing the data, which goes very fast
> until sleepycat requires extensive disk access.  The in-memory rate is
> reasonable and would complete in a few hours.  However, once disk
> operations begin the import speed suggests it would take many days to
> complete.  I have yet to perform extensive benchmarks, but I estimate
> the instantiation rate shifts from 1800 persistent class
> instantiations /second to 120 / s.
>
> here are the two approaches that I hypothesize may help performance. 
> I am admittedly unaware of innards of the two systems in question, so
> you expert developers will know best.  If either sounds appropriate or
> you envision another possibility for allowing this kind of scaling, I
> will look into implementing such a system.
>
> 1.  decreasing the size of the working set is one possibility for
> decreasing run-time memory requirements and disk access.  I'm not sure
> how the concept of a 'working set' translates from the sleepycat world
> to the elephant world, but perhaps you do.
>
> 2.  using a Hash instead of a BTree in the primary database?  I am
> unsure what this means for elephant.
>
> In the mean time I will depart from the every-class-is-persistent
> approach and also use more traditional data structures.
>
> Thanks again,
> Red Daly
>
>
>
> Robert L. Read wrote:
>> Yes, it's amusing.
>>
>> In my own work I use the Postgres backend; I know very little about
>> SleepyCat.  It seems
>> to me that this is more of a SleepyCat issue, then an Elephant
>> issue.  Perhaps you should
>> ask the SleepyCat list?
>>
>> Are you importing things into SleepyCat directly in the correct
>> serialization format that
>> they can be read by Elephant?  If so, I assume it is just a question
>> of solving the SleepyCat
>> problems.
>>
>> An alternative would be to use the SQL-based backend.  However, I
>> doubt this will solve
>> your problem, since at present we (well, I wrote it) use a very
>> inefficient serialization scheme
>> for the SQL-based backend that base64 encodes everything.  This had
>> the advantage that
>> it makes it work trouble-free with different database backends, but
>> could clearly be improved upon.
>> However, it is more than efficient enough for all my work, and at
>> present nobody is clamoring
>> to have it improved.
>>
>> Is your problem importing the data or using it once it is imported? 
>> It's hard for me to imagine
>> a problem so large that even the import time is a problem --- suppose
>> it takes 24 hours --- can
>> you not afford to pay that?
>>
>> A drastic measure and potentially expensive measure would be to
>> switch to a 64-bit architecture
>> with a huge memory.  I intend to do that when forced by performance
>> issues in my own work.
>>
>>
>>
>> On Tue, 2006-10-10 at 00:46 -0700, Red Daly wrote:
>>> I will be running experiments in informatics and modeling in the
>>> future that may contain (tens or hundreds of) millions of objects. 
>>> Given the ease of use of elephant so far, it would be great to use
>>> it as the persistent store and avoid creating too many custom data
>>> structures.
>>>
>>> I have recently run up against some performance bottlenecks when
>>> using elephant to work with very large datasets (in the hundreds of
>>> millions of objects).  Using SleepyCat, I am able to import data
>>> very quickly with a DB_CONFIG file with the following contents:
>>>
>>> set_lk_max_locks 500000
>>> set_lk_max_objects 500000
>>> set_lk_max_lockers 500000
>>> set_cachesize 1 0 0
>>>
>>> I can import data very quickly until the 1 gb cache is too small to
>>> allow complete in-memory access to the database.  at this point it
>>> seems that disk IO makes additional writes happen much slower.  (I
>>> have also tried increasing the 1 gb cache size, but the database
>>> fails to open if it is too large--e.g. 2 gbs.  I have 1.25 gb
>>> physical memory and 4 gb swap, so the constraint seems to be
>>> physical memory.)  the max_lock, etc. lines allow transactions to
>>> contain hundreds of thousands of individual locks, limiting the
>>> transaction throughput bottleneck
>>>
>>> What are the technical restrictions on writing several million
>>> objects to the datastore?  Is it feasible to create a batch import
>>> feature to allow large datasets to be imported using reasonable
>>> amounts of memory for a desktop computer?
>>>
>>> I hope this email is at least amusing!
>>>
>>> Thanks again,
>>> red daly
>>> _______________________________________________
>>> elephant-devel site list
>>> elephant-devel at common-lisp.net <mailto:elephant-devel at common-lisp.net>
>>> http://common-lisp.net/mailman/listinfo/elephant-devel
>>>     
>> ------------------------------------------------------------------------
>>
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel



More information about the elephant-devel mailing list