[elephant-devel] Representational Question

Fri Mar 7 03:32:03 UTC 2008

Robert makes an excellent point.  For datasets that fit in memory,  
caching objects and slot values in memory makes the use of lisp as a  
query language really easy.

Another (unreleased) prevalence-like facility in Elephant:

In src/contrib/eslick/snapshot-set.lisp is a simple object caching  
model that works for non-persistent object.  It allows you to register  
objects with a special hash as 'root' objects.  This hash can be saved  
and restored and it stores the root objects and all objects  
'reachable' from the root set.  The notion of reachable can be  
overloaded but now it's defined recursively for any standard object or  
hash in a slot of a reachable object.  The whole snapshot-set concept  
is about 300 lines of code, so pretty easy to read as an example.

A potential proposal:

It's also fairly easy to add a special cached-persistent-slot which  
caches its values and implements a write-through policy.  This allows  
you to keep all your slot-accesses in memory (making object-based  
search very efficient) but still exploit on-disk BTrees for indexing  
when you need to.

You'd have to think through the implications of this strategy,  
though.  It works great if your data is read-only or only operated on  
in one thread.  If you can handle some in-coherence (the slot value  
can be changed at any time) in your read-oriented algorithms then you  
can ignore threading issues.

(Hmmmm...one hack might be to force a database read of cached slots  
when you are in a transaction so you can guarantee that any writes to  
that page in a parallel transaction result in a restart.  If you are  
just doing auto-commit, the read is to the cached value).

Ian

On Mar 6, 2008, at 10:02 PM, Robert L. Read wrote:

> On Thu, 2008-03-06 at 10:10 -0500, Ian Eslick wrote:
>> I agree with Robert.  The best way to start is to use lisp as a
>> query
>> language and essential do a search/match over the object graph.
>>
>> The rub comes when you start looking at performance.  A linear scan
>> of
>
> I neglected to mention that in my use of Elephant, when I was  
> attempting
> to run a commercial website, I was using the Data Collection  
> Management
> (DCM) stuff that you can find in the contrib/rread directory of the
> project.
>
> This system provides strategy-based directors.  That is, there is a
> basic factory object for each collection of objects that implements
> basic Create, Read, Update, Delete operations.
>
> When you initialize a director, you specify a storage strategy:
>
> *) In-memory hash, (no persistence, for transient objects)
> *) Elephant (no caching)
> *) Cache backed by Elephant (read in memory, with writes immediately
> flushed to the store)
> *) Generational system, in which each generation can have its own
> storage strategy.
>
> Everything Ian wrote in the last email about scanning and locality of
> reference makes perfect sense, but is assuming that you don't have  
> every
> object cached.  That approach is therefore not very "Prevalence"- 
> like in
> its performance, but is very "Prevalence"-like in its convenience.
> Using DCM, or any other caching where most of the object are cached,
> tends to you go the performance described in the IBM article on
> Prevalence that I referenced.
>
> However, DCM was written BEFORE Ian got the class indexing and
> persistence working.  DCM is not nearly as pretty and clean as the
> persistent classes.  You end up having to make storage decisions
> yourself.
>
> A perfect system might be persistent classes with really excellent
> control over the caching/write-updating policy.
>
> For any application, I a would recommend using Ian's persistent  
> classes
> at the beginning project stages, and then when your performance tests
> reveal you have a problem, consider at that point whether to add
> indexes, move to explicitly keeping a class in memory, or some other
> solution.
>
>