[elephant-devel] Lisp Btrees: design considerations

Ian Eslick eslick at media.mit.edu
Wed May 14 13:40:40 UTC 2008


On May 14, 2008, at 3:28 AM, Leslie P. Polzer wrote:

>
>> For example, in BDB, the primitive is the page.  BTree nodes are  
>> layed
>> out in one or more pages, each page has some binary metadata
>> explaining it's type and organization (free list, etc).  A short key
>> value is written directly into the page, a long one is written into  
>> an
>> overflow page, etc.
>
> InnoDB also uses this approach.

There is a massive body of work and many variations on ways to lay out  
indexing structures and data on disk.  The tradeoffs depend on your  
specific needs.  My recommendation is we don't get ambitious and stick  
with a simple BTree or perhaps B+Tree for now.  If we abstract  
properly, we should be able to replace the underlying page storage  
mechanism later.

>
>> - new keys and values of differing or less length are written in
>> place, otherwise new
>>   space is allocated at the end of the file.
>
> Okay, but maybe let the user reserve extent space.


>
>> - transactions simply keep track of the primitive operations on the
>> database and the associated data in a memory queue and write those  
>> ops
>> to disk as part of the txn commit process.  The pages and key/value
>> pairs that will be touched in that operation are also stored in that
>> txn log.
>> - when a transaction commits, it replays the log to write everything
>> to disk appropriately.  The list of touched data is then passed up  
>> the
>> commit chain to invalidate any pending transactions that have a
>> conflict.  Everything is speculative in this case, but we don't have
>> to deal with locking.
>
> I like this approach.
>
>
>> A big decision is:
>> - Use cffi/uffi and do much of the serialization & btree
>> implementation in C/static memory
>
> IME this FFI stuff can be quite hard to debug.

That's true, but the CFFI operations would basically be the primitive  
set that are already used in memutils to implement buffer streams.

>
>> I lean towards using cffi to manipulate static data, just because  
>> it's
>> going to be easier to get performance via that method and it's also
>> going to be much easier to do a multi-process implementation (by
>> operating on static memory and primitive locks in a shared memory
>> region).
>
> I cannot estimate the performance trade-offs involved here,
> but in general I'm in favor of a Lisp-based approach...

I think if we abstract correctly with a more direct approach in mind,  
we could always go back and do it the other way later.  Probably best  
to start with lisp, it just may mean more work in the meantime to  
rewrite the functionality in memutils and serializer2.

>
>>   - the binary pages could be stored in static data
>
> I don't think I understand this. What does "static" mean here
> (and above)?

Static data just means that allocated from the C heap, not the lisp  
garbage collector.  Lisp data structures

>
>> and the primitives btree ops
>>     could directly manipulate data within the page?  We pass a C
>> function that
>>     directly sorts binary sequences rather than having to deserialize
>> to sort.  We'd
>>     need to write that in lisp to operate on static data or on lisp
>> arrays.  Deserializing
>>     on each key comparison is too expensive.
>
> Yes.
>
>
>> A nice auxiliary hack would be:
>> - rewrite memutils to entirely use uffi/cffi to manipulate static  
>> data
>> rather
>>   than calling out to C to do it.  Maintains efficiency but removes
>> the compilation
>>   build step except for users of BDB
>
> I have looked at memutils/the serializer, but it's very hard for me
> to replace it by something else, because I'm not sure what would be
> required of a replacement.
>
> The partial conclusion to which I came was that memutils just models
> a bivalent stream so the backend can communicate with the serializer.

That's essentially correct, although memutils was written to implement  
buffer streams which are then used by the serializer - in effect it's  
the serializer that uses memutils to send serialized buffers to the  
data stores.  It was originally designed for BDB and has the benefit  
of avoiding an extra copy step when talking to the BDB API.  If you  
are serializing structures into a lisp array, to pass them to C you  
need to copy them to a C array, and BDB then copies that array into  
the appropriate cached page and ultimately writes that page to disk.

Using memutils as-is is actually more expensive for a lisp-only  
solution if we're storing pages in lisp arrays (which hopefully will  
quickly become tenured and stop being copied around by the  
collector).  We would then write into a C array with memutils, copy  
into a lisp array and write that to disk rather than writing directly  
to the lisp array.

> The data format is hard to figure out for me, however, because of
> all the UFFI stuff involved...

The serializer is probably fine as it is from a functional standpoint,  
but to do it in lisp we'll need to change out all the primitives it  
uses (buffer-write-int32, buffer-write-byte, etc).

buffers streams are wrappers around C arrays

(defstruct buffer-stream
   "A stream-like interface to foreign (alien) char buffers."
   (buffer (allocate-foreign-object :unsigned-char 10) :type array-or- 
pointer-char)
   (size 0 :type fixnum)
   (position 0 :type fixnum)
   (length 10 :type fixnum))

We allocate that object directly from the C heap (just like malloc)  
using the allocate-foreign-object call, this is currently in uffi and  
using the uffi compat layer from cffi leads to errors.

Size is the current size of valid data in the array (a write  
pointer).  The position is the current read pointer and the length is  
the total size of the allocated region.

If I write two int32's into the stream the values above are:
size 8, position 0, length 10


If the BTree node binary data is stored in a lisp octet array, and we  
don't want to deserialize to compare keys, then we need to write the  
procedure that you can find in libberkeley-db.c in the src/db-bdb  
file.  It performs a content-sensitive comparison on a byte-by-byte  
basis.  This has huge benefits of allowing us to do key comparisons  
'in-place' without creating lisp objects.  We pass this C function to  
BDB which uses it to compare keys and values directly in the binary  
file pages.

I tell you what, if we wrap an abstraction around a lisp paged btree  
approach, we can cheat with memutils for the time being and replace it  
later.  Every field written to a binary page should have an extent  
associated with it.  To compare keys, we copy that extent to a buffer  
stream, run the comparison operation in the existing memutils, etc.   
To serialize/deserialize, we just copy to/from the lisp binary page  
and the buffer stream abstraction.

To lookup data in a page you would do something like:

serialize key to a buffer stream (bs1)
database file => read fixed sized page into an array
page => copy serialized key to buffer stream (bs2)
compare bs1 and bs2
look at btree page to decide what next key to compare or,
copy the value to a buffer stream bs3 and run the deserializer.

Later this should be something like:
serialize key to a lisp array
database file => read fixed sized page into an array
byte-by-byte comparison of key in lisp array to a field at offset +  
length in the binary page
on success, deserialize from the offset + length field in the binary  
page.

I'm sorry if this is confusing.  I have some code dealing with fixed  
pools of binary pages and some attempts at a btree and log  
implementation in src/contrib/db-lisp.  Rucksack is also an excellent  
source of ideas.  the Rucksack persistence model is too different from  
elephant for me to be comfortable adapting it, but it has a much more  
elegant serializer and an all-in-lisp implementation of btrees (as I  
recall).  That is another great place to start getting ideas.

Ian


>  Leslie
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel




More information about the elephant-devel mailing list