[elephant-devel] Rucksack and Elephant
Ian Eslick
eslick at csail.mit.edu
Sun Jun 4 06:26:34 UTC 2006
I distracted myself this afternoon by writing a cached binary file and
buffer library with serializer as a potential step towards a native
backend for Elephant. As I was contemplating some design decisions, I
was curious how Arthur Lemmons made similar trade offs in Rucksack,
motivating me to give his code a good read. That experience prompted
the following comparison.
(Rucksack is described in detail here:
http://weitz.de/eclm2006/rucksack-eclm2006.txt)
At present Elephant is fully functional and has been tested and used
extensively in several demanding applications. Rucksack is not yet
operational, but has a critical mass of code written for all
functionality and has some architectural features worth keeping an eye
on. The most exciting feature, of course, is that Rucksack is written
entirely in mostly portable Common Lisp!
Serialization:
Both systems take a similar approach to binary serialization and
should perform similarly.
Persistent object storage:
Rucksack and Elephant handle persistent objects very differently. In
Elephant, every slot has a serialized descriptor (oid:class:slotname)
that is used as a key to store all slot values in one large BDB BTree.
The object oid is stored in class instances and used, along with class
and slot names to index into the on-disk BTree to retrieve or overwrite
a value.
In Rucksack, object OIDs index a large vector which contain the current
on-disk location of the serialized objects. On slot-writes, a new
instance of the object is written to disk. On transaction commit, the
vector pointer is updated. This requires Rucksack to commit to garbage
collection in order to reclaim stored objects (something Elephant
doesn't do as BDB handles transaction logging differently and does
writes in place). However, the Rucksack choice provides a convenient
way to handle transaction logging and rollbacks without a separate
logging mechanism.
This means that Rucksack has to serialize all dirty objects when it
commits a transaction. This involve more writing of the disk and more
total disk access than Elephant which only writes changed slot values.
Within a transaction Rucksack provides an in-memory object cache of
dirty objects and maintains a cache of committed objects as well so that
future transaction don't need to re-serialize objects.
MOP:
The metaobject protocol support for persistent objects is similar,
although Rucksack's is simpler in part because it makes more commitment
to object level storage instead of slot-level storage. Both Elephant
and Rucksack support schema evolution, the ability to redefine objects
at runtime and have the persistent instances updates as in
UPDATE-INSTANCE-FOR-REDEFINED-CLASS. Rucksack saves prior schemas so
old instances can be loaded and then updated. Elephant effectively does
the same by storing slot names so that the new schema can pick old
values stored in the same name, then run the loaded instance through the
update function. There are some potential pitfalls here in Elephant and
I was intending to fix them in a similar way to Rucksack as part of a
serializer enhancement to avoid writing slot names all the time.
Garbage collection:
Rucksack has a full incremental mark-and-sweep collector. Elephant only
has a poor-man's stop-and-copy via the repository migration interface
(support for doing this automatically is not built in and it's
expensive). Enough said.
ACID:
Rucksack has an elegant solution to ACID properties by copy-on-write
for persistent objects so that each parallel transaction has its own set
of live objects. This avoids conflicts but also delays rollbacks. When
a transaction has to abort because of a conflict, it just throws away
the live objects in memory and restarts. This does mean that rollbacks
are caused by object level write conflicts instead of slot conflicts.
Summary:
Rucksack is an elegant approach to persisting objects in Common Lisp.
Its interface and Elephant's are very similar but they take a number of
different and incompatible approaches to handling persistent slots,
transactions, locking, etc. I don't foresee significant performance
advantages on either side, but the serializer in Rucksack seems more
efficient for standard objects at the cost of some robustness on class
redefinition. I imagine I will be surprised by real-world benchmarks
later. For example, I suspect that transaction performance will vary
greatly based on workload. Typical website models should work the same
on either as there are far fewer possible transaction collisions.
Unfortunately Rucksack isn't easily re-targeted as a native lisp backend
for Elephant because of the greatly differing assumptions behind
persistent objects. There may be a bit of code and design ideas that
can be lifted however - such as the heap and btree implementation.
There are some smart ideas in the serializer and in schema evolution
that I've considered already so it's nice to have a reference
implementation to refer to.
Notable differences:
- Rucksack is a reasonably compact, easy-to-understand system written
entirely in Common Lisp. Elephant has complex dependencies between
Lisp, C and the architectural commitments of BDB. Elephant performs
poorly on SQL today so BDB is the high performance backend. BDB has
license issues for even small scale commercial deployment.
- Rucksack has full support for garbage collection, Elephant has minimal
off-line support for storage reclamation
- Elephant will allow multiple lisp processes to use the same persistent
store concurrently, a Rucksack store is locked to a single lisp
instance. Elephant can be configured with BDB replication, allowing for
larger-scale deployment.
- Elephant is much more mature and it's disk storage is much more likely
to be reliable so it will be some time until Rucksack is sufficiently
mature for prime time.
- Rucksack performs object-level collision detection, Elephant performs
record-based collision in a paged storage system. This has different
implications for how classes should be designed (slot values with large
arrays, for instance, should be wrapped in their own persistent class so
that writes to other slots does not result in multiple copies of that
array).
This review has been somewhat rambling, but I hope it makes people look
forward to playing with Rucksack, produces some good ideas for Elephant
and emphasizes that Elephant is ready for real world (although probably
non-critical) applications today.
Ian
More information about the elephant-devel
mailing list