[elephant-devel] Lisp Btrees: design considerations

Tue May 13 19:46:10 UTC 2008

I think they key decision was what serialization format we're going to  
use for btree nodes, log entries, etc and how that relates to caching  
data during transactions, maintaining empty lists, etc.

The current serializer produces a byte sequence.  If we continue with  
that model, how do we write/read this stuff from disk?  How and where  
do we store it prior to committing a transaction?

When we create a new key or value as a binary stream within a  
transaction, how is it stored in memory?  If we want a multi-process,  
but non-socket based approach, we need to figure out how to store data  
in shared memory, etc.

For example, in BDB, the primitive is the page.  BTree nodes are layed  
out in one or more pages, each page has some binary metadata  
explaining it's type and organization (free list, etc).  A short key  
value is written directly into the page, a long one is written into an  
overflow page, etc.  Lots of details to deal with in managing variable  
sized data on disk.  Pages that are dirty are kept in memory (which is  
why BDB can run out of transaction space; the pages overflow the max  
cache size when you are writing lots of data).

However, to get started, the easiest thing is to reuse the existing  
memutils serializer, not worry about multi-process operation and not  
worry about fragmentation, sharing space and maintaining free lists  
(except perhaps for btree nodes).

Something like:
- btree nodes only keep pointers to variable sized keys stored  
elsewhere in the file
- new keys and values of differing or less length are written in  
place, otherwise new
   space is allocated at the end of the file.
- btree nodes are a fixed size page on-disk and keep some free-list  
information so we can reuse them.
- transactions simply keep track of the primitive operations on the  
database and the associated data in a memory queue and write those ops  
to disk as part of the txn commit process.  The pages and key/value  
pairs that will be touched in that operation are also stored in that  
txn log.
- when a transaction commits, it replays the log to write everything  
to disk appropriately.  The list of touched data is then passed up the  
commit chain to invalidate any pending transactions that have a  
conflict.  Everything is speculative in this case, but we don't have  
to deal with locking.

This is a nice balance between some lisp-sexp serialization format  
that performs poorly, and a highly-optimized low-level implementation  
which is blindingly fast.

A big decision is:
- Use cffi/uffi and do much of the serialization & btree  
implementation in C/static memory
   or do all of this in pools of static arrays and write a new  
serializer to operate on lisp data.

I lean towards using cffi to manipulate static data, just because it's  
going to be easier to get performance via that method and it's also  
going to be much easier to do a multi-process implementation (by  
operating on static memory and primitive locks in a shared memory  
region).

Predicated on that decision, getting started on the simplest possible  
btree/dup-btree implementation is the next, most valuable and  
educational step.

The key pieces for a single-process lisp backend:
- btrees and dup-btrees (indices can be built from these two easily  
enough)
   - the binary pages could be stored in static data and the  
primitives btree ops
     could directly manipulate data within the page?  We pass a C  
function that
     directly sorts binary sequences rather than having to deserialize  
to sort.  We'd
     need to write that in lisp to operate on static data or on lisp  
arrays.  Deserializing
     on each key comparison is too expensive.
- a set of transaction records (lisp structs and consts)
   - simply keeps tuples of (op {rd | wr} {btree+page-offset | value- 
offset}  [values])
     in a memory queue.  Could use static memory for this to reduce  
load on GC
- a blocking primitive library that serializes txn commits
   (i.e. write log to disk, write data to disk, write 'commit done' to  
log,
    invalidate pending/conflicting txns)

A nice auxiliary hack would be:
- rewrite memutils to entirely use uffi/cffi to manipulate static data  
rather
   than calling out to C to do it.  Maintains efficiency but removes  
the compilation
   build step except for users of BDB

So what do people think about the cffi->static-data vs. lisp->array- 
pool decision?

Ian

On May 13, 2008, at 2:03 PM, Leslie P. Polzer wrote:

>
> I suppose the "binary paging" approach mentioned in the design  
> considerations
> document means the problem of organizing the data efficiently on disk
> right from the start. Is this correct?
>
> Do you think it would make good sense to start working on the btree  
> library
> without thinking much about on-disk efficiency, leaving this part  
> for later?
>
> I'm not sure a btree where on-disk storage organization is separeted  
> from the
> rest like that can achieve enough efficiency...
>
>  Thanks,
>
>    Leslie
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel