[elephant-devel] Data Collection Management/Prevalence.

Sat May 17 01:37:52 UTC 2008

On Fri, 2008-05-16 at 11:26 -0400, Ian Eslick wrote:
> Prevalence is much, much faster because you don't have to flush data  
> structures on each commit, so cl-prevalence performance with  
> Elephant's data and transaction abstractions would be a really nice  
> design point.  I wonder if we would get some of this benefit from a  
> Rucksack adaptation? 

I'd like to take this opportunity to explain something.

We already have a "prevalence-ish" auxilliary system in the
contrib/rread/dcm directory.

This implements what you might call "prevalence"-style caching ---
everything is stored in a hashtable, and any writes or additions are
written immediately to Elephant (which ends up doing the write I/O)
before control is returned.  It's also thread-safe, I think.

I wrote it right after I got the CL-SQL backend working.  That is in
fact part of the reason that I never worried about making the CL-SQL
backend faster -- the caching took care of almost all of my needs.  I
didn't mind paying for the writes, since each one typically was in
response to a human being clicking a browser button, and the writes were
certainly faster than that.

I called this system "DCM" for Data Collection Managment.  In fact it
implements what you might call a Tier 2 or "Business Object" cache.  It
writes objects directly to btrees, and creates its own keys.  In a way,
it does what Ian's class-based persistence does.

However, I haven't touted because it isn't very good.  It has the
following drawbacks:

1)  I was relatively new to LISP when I wrote it,
2)  It makes no use of Ian's new stuff (which is newer than DCM),
3)  It uses SBCL-specific locking,
4)  It does not have a lot of tests,
5)  It does not allow limitations to the cache size --- it assumes
you have enough memory for those classes for which you use the in-memory
caching strategy,
6)  I think its use of btrees could be pretty bad,
7) it might not run now, as it has not been under test for a while.

However, the one thing it does really well, that I don't think we have
at the current level, is object-level caching.  (If I am wrong about
that, please enlighten me.)

I haven't looked at Rucksack yet, but I would venture that if we wanted
to write a Prevalence-style system, we could examine DCM and either
improve it or take some of its approach as a starting point.

One way in which this would differ from Prevalence is that this a
"write-through cache", not a "journaling transaction system."  A true
journaling system would have less I/O and would allow more control of
checkpointing strategy.

One stylistic point that I'm undecided on is which of these styles is
better:

1)  You have an explicit "manager" or "director" that is responsible for
Create, Read, Update Delete (CRUD) operations on a class of objects.
The managed class is not itself inherently persistent; it is persisted
when you call "register" on the managed object via the "manager".  When
you instantiate a manager, you specify a caching strategy by subclassing
a manager class that follows a particular strategy.

2)  You use the MOP to make a class really intelligent, and let it be
re-defclassed with different settings, and you implement lots of slot
keywords to say which slots are transient, persistent, etc., and in
general think of the "class", when treated as an actual data value as it
is in lisp, as responsible for caching and persistence, thus doing the
same job that the "manager" does in the other model.

What Elephant has now is the latter; DCM is the former.  DCM will be
familiar to your typical Java/C# software engineer.

Given all of the discussion around performance, it is hard for me to
personally sort out how important performance really is, and how best to
get it, not because it is particularly hard, but because we don't seem
to have anybody with an urgent use-case, and because object-level
caching is so effective (at least for me.)

So I'll go out on a limb and say that offering object-level caching is
the single biggest performance enhancement we make for the most common
cases.

If we agree with that, then we can start to imagine how we would most
like to implement this.

I personally don't have any objection to the "manager" pattern as a
discrete object.  However, I think most of our users are happier with
the "defpclass" approach.  So I think the ideal situation would be to
expand the "defpclass" macro to allow one to specify a caching strategy,
and the parameters that control such a strategy.  Whether this should be
slot-based or object based or some combination, I'm not sure.

It ought to be quite simple to create some performance tests that
clarify the performance of this approach.

However, I don't know if this is more important than a native-lisp
backend, or a query-language.  For the next year at least I am working
at a job rather than working on my lisp application; and even then I was
happy with the performance I was getting out of DCM.  So I personally
don't have performance need that drives anything.  I wish I knew how
many new users we would have from better performance vs. a native-lisp
backend vs. a query-language, or what our existing users would prefer.