[elephant-devel] Lisp Btrees: design considerations

Ian Eslick eslick at media.mit.edu
Wed May 14 14:05:35 UTC 2008


Here is an interesting discussion of using lisp for a large, high- 
performance system and the reasons to use static/C data.



I wasn't advocating using C at all, and in fact removing all  
dependency on it for the lisp-only backend.  With cffi or uffi you can  
get direct access to memory primitives like allocating data and  
reading/writing pointers without needing a C compiler or library.  The  
only difference is that you aren't using lisp heap data structures.

I'm all for performance improvements and agree that there is  
reasonable compressibility in much of the data that we generate, but  
given that disk space is virtually free these days and that you are  
increasing time, perhaps significantly, to save space (i.e. you now  
have to decompress each key prior to doing an ordering comparison  
unless you are being very format sensitive like having a global  
tokenization of language strings).  I'm supportive, but skeptical.  :)

Perhaps there would be a way to hook into the comparison function and  
provide hints to the serializer so that it generates a tagged,  
compressed format that a custom comparison operator can dispatch on  
and the user/system hacker could write these domain or format specific  
extensions of the serializer.

One big point from all this is that we can't do this on other backends  
because BDB is doing the comparisons internally and it needs a C  
function which performs ordering comparisons.  We don't want to  
artificially instill this restriction in an all-lisp backend.

Ian

On May 14, 2008, at 9:53 AM, Robert L. Read wrote:

> I am certainly in favor of using fixed-size pages as a basic model.  I
> think this is very common.  It has the advantage of allowing page- 
> based
> caching more easily.  One can even contemplate things like a RAID- 
> style
> striping across machines on a LAN or WAN.
>
> I think we should create a really pure-lisp solution, that doesn't  
> on C
> in any way.  Certainly we should do that first.
>
> However, I have personal reason for preferring a completely LISP-based
> serializer:  I think there is a lot of opportunity for major  
> performance
> enhancements by making the serializer more efficient---by which I mean
> producing very compressed representations of the data serialized.  For
> example, if we could produce a 10-fold reduction in the size of the
> compressed data, we would have a game-changing improvement, which  
> would
> probably make us much faster than BDB or any relational system.  If  
> that
> is true, then almost any amount of CPU time spent in the codec is
> warranted.
>
> A 10-fold improvement may seem unlikely until one stops to consider  
> the
> applicability of Huffman coding of English strings, prefix-forward- 
> index
> encoding of ordered strings, Huffman encoding on our little
> type-designators, the possibility of keeping a dictionary in a cached
> page for LZW-style compression, the possibility of detecting a Poisson
> distribution of integer values by declaration or statistics, etc.
> However, I think this is a very "researchy" line of thought.  I would
> like to have the ability to explore these ideas by writing them up in
> LISP code, even though they definitely would represent premature
> optimization until we had a basic B-Tree working.
>
> It's not 100% clear to me that C would offer a more performant  
> solution
> in any case.  It is probably true that a high-performance serializer
> would have to be tuned to various LISP implementations.
>
>
>
> On Tue, 2008-05-13 at 15:46 -0400, Ian Eslick wrote:
>> I think they key decision was what serialization format we're going  
>> to
>> use for btree nodes, log entries, etc and how that relates to caching
>> data during transactions, maintaining empty lists, etc.
>>
>> The current serializer produces a byte sequence.  If we continue with
>> that model, how do we write/read this stuff from disk?  How and where
>> do we store it prior to committing a transaction?
>>
>> When we create a new key or value as a binary stream within a
>> transaction, how is it stored in memory?  If we want a multi-process,
>> but non-socket based approach, we need to figure out how to store  
>> data
>> in shared memory, etc.
>>
>> For example, in BDB, the primitive is the page.  BTree nodes are  
>> layed
>> out in one or more pages, each page has some binary metadata
>> explaining it's type and organization (free list, etc).  A short key
>> value is written directly into the page, a long one is written into  
>> an
>> overflow page, etc.  Lots of details to deal with in managing  
>> variable
>> sized data on disk.  Pages that are dirty are kept in memory (which  
>> is
>> why BDB can run out of transaction space; the pages overflow the max
>> cache size when you are writing lots of data).
>>
>>
>> However, to get started, the easiest thing is to reuse the existing
>> memutils serializer, not worry about multi-process operation and not
>> worry about fragmentation, sharing space and maintaining free lists
>> (except perhaps for btree nodes).
>>
>> Something like:
>> - btree nodes only keep pointers to variable sized keys stored
>> elsewhere in the file
>> - new keys and values of differing or less length are written in
>> place, otherwise new
>>   space is allocated at the end of the file.
>> - btree nodes are a fixed size page on-disk and keep some free-list
>> information so we can reuse them.
>> - transactions simply keep track of the primitive operations on the
>> database and the associated data in a memory queue and write those  
>> ops
>> to disk as part of the txn commit process.  The pages and key/value
>> pairs that will be touched in that operation are also stored in that
>> txn log.
>> - when a transaction commits, it replays the log to write everything
>> to disk appropriately.  The list of touched data is then passed up  
>> the
>> commit chain to invalidate any pending transactions that have a
>> conflict.  Everything is speculative in this case, but we don't have
>> to deal with locking.
>>
>> This is a nice balance between some lisp-sexp serialization format
>> that performs poorly, and a highly-optimized low-level implementation
>> which is blindingly fast.
>>
>> A big decision is:
>> - Use cffi/uffi and do much of the serialization & btree
>> implementation in C/static memory
>>   or do all of this in pools of static arrays and write a new
>> serializer to operate on lisp data.
>>
>> I lean towards using cffi to manipulate static data, just because  
>> it's
>> going to be easier to get performance via that method and it's also
>> going to be much easier to do a multi-process implementation (by
>> operating on static memory and primitive locks in a shared memory
>> region).
>>
>> Predicated on that decision, getting started on the simplest possible
>> btree/dup-btree implementation is the next, most valuable and
>> educational step.
>>
>> The key pieces for a single-process lisp backend:
>> - btrees and dup-btrees (indices can be built from these two easily
>> enough)
>>   - the binary pages could be stored in static data and the
>> primitives btree ops
>>     could directly manipulate data within the page?  We pass a C
>> function that
>>     directly sorts binary sequences rather than having to deserialize
>> to sort.  We'd
>>     need to write that in lisp to operate on static data or on lisp
>> arrays.  Deserializing
>>     on each key comparison is too expensive.
>> - a set of transaction records (lisp structs and consts)
>>   - simply keeps tuples of (op {rd | wr} {btree+page-offset | value-
>> offset}  [values])
>>     in a memory queue.  Could use static memory for this to reduce
>> load on GC
>> - a blocking primitive library that serializes txn commits
>>   (i.e. write log to disk, write data to disk, write 'commit done' to
>> log,
>>    invalidate pending/conflicting txns)
>>
>> A nice auxiliary hack would be:
>> - rewrite memutils to entirely use uffi/cffi to manipulate static  
>> data
>> rather
>>   than calling out to C to do it.  Maintains efficiency but removes
>> the compilation
>>   build step except for users of BDB
>>
>>
>> So what do people think about the cffi->static-data vs. lisp->array-
>> pool decision?
>>
>>
>> Ian
>>
>> On May 13, 2008, at 2:03 PM, Leslie P. Polzer wrote:
>>
>>>
>>> I suppose the "binary paging" approach mentioned in the design
>>> considerations
>>> document means the problem of organizing the data efficiently on  
>>> disk
>>> right from the start. Is this correct?
>>>
>>> Do you think it would make good sense to start working on the btree
>>> library
>>> without thinking much about on-disk efficiency, leaving this part
>>> for later?
>>>
>>> I'm not sure a btree where on-disk storage organization is separeted
>>> from the
>>> rest like that can achieve enough efficiency...
>>>
>>> Thanks,
>>>
>>>   Leslie
>>>
>>> _______________________________________________
>>> elephant-devel site list
>>> elephant-devel at common-lisp.net
>>> http://common-lisp.net/mailman/listinfo/elephant-devel
>>
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel




More information about the elephant-devel mailing list