[elephant-devel] Lisp Btrees: design considerations

Wed May 14 14:13:14 UTC 2008

Here is the reference:  http://www.paulgraham.com/carl.html

I was quite wrong-headed on the performance point.  The biggest time  
cost in database operations is time spent moving data to/from disk, in  
comparison CPU time is quite small.  I was thinking about the internal  
copying overhead, which is still quite small.

Oracle and others have had good success with compression.  To get the  
juices flowing, this is a very clear (although a bit dated)  
articulation of the ideas: http://web.cecs.pdx.edu/~len/compression.pdf

Ian

On May 14, 2008, at 9:53 AM, Robert L. Read wrote:

> I am certainly in favor of using fixed-size pages as a basic model.  I
> think this is very common.  It has the advantage of allowing page- 
> based
> caching more easily.  One can even contemplate things like a RAID- 
> style
> striping across machines on a LAN or WAN.
>
> I think we should create a really pure-lisp solution, that doesn't  
> on C
> in any way.  Certainly we should do that first.
>
> However, I have personal reason for preferring a completely LISP-based
> serializer:  I think there is a lot of opportunity for major  
> performance
> enhancements by making the serializer more efficient---by which I mean
> producing very compressed representations of the data serialized.  For
> example, if we could produce a 10-fold reduction in the size of the
> compressed data, we would have a game-changing improvement, which  
> would
> probably make us much faster than BDB or any relational system.  If  
> that
> is true, then almost any amount of CPU time spent in the codec is
> warranted.
>
> A 10-fold improvement may seem unlikely until one stops to consider  
> the
> applicability of Huffman coding of English strings, prefix-forward- 
> index
> encoding of ordered strings, Huffman encoding on our little
> type-designators, the possibility of keeping a dictionary in a cached
> page for LZW-style compression, the possibility of detecting a Poisson
> distribution of integer values by declaration or statistics, etc.
> However, I think this is a very "researchy" line of thought.  I would
> like to have the ability to explore these ideas by writing them up in
> LISP code, even though they definitely would represent premature
> optimization until we had a basic B-Tree working.
>
> It's not 100% clear to me that C would offer a more performant  
> solution
> in any case.  It is probably true that a high-performance serializer
> would have to be tuned to various LISP implementations.
>
>
>
> On Tue, 2008-05-13 at 15:46 -0400, Ian Eslick wrote:
>> I think they key decision was what serialization format we're going  
>> to
>> use for btree nodes, log entries, etc and how that relates to caching
>> data during transactions, maintaining empty lists, etc.
>>
>> The current serializer produces a byte sequence.  If we continue with
>> that model, how do we write/read this stuff from disk?  How and where
>> do we store it prior to committing a transaction?
>>
>> When we create a new key or value as a binary stream within a
>> transaction, how is it stored in memory?  If we want a multi-process,
>> but non-socket based approach, we need to figure out how to store  
>> data
>> in shared memory, etc.
>>
>> For example, in BDB, the primitive is the page.  BTree nodes are  
>> layed
>> out in one or more pages, each page has some binary metadata
>> explaining it's type and organization (free list, etc).  A short key
>> value is written directly into the page, a long one is written into  
>> an
>> overflow page, etc.  Lots of details to deal with in managing  
>> variable
>> sized data on disk.  Pages that are dirty are kept in memory (which  
>> is
>> why BDB can run out of transaction space; the pages overflow the max
>> cache size when you are writing lots of data).
>>
>>
>> However, to get started, the easiest thing is to reuse the existing
>> memutils serializer, not worry about multi-process operation and not
>> worry about fragmentation, sharing space and maintaining free lists
>> (except perhaps for btree nodes).
>>
>> Something like:
>> - btree nodes only keep pointers to variable sized keys stored
>> elsewhere in the file
>> - new keys and values of differing or less length are written in
>> place, otherwise new
>>   space is allocated at the end of the file.
>> - btree nodes are a fixed size page on-disk and keep some free-list
>> information so we can reuse them.
>> - transactions simply keep track of the primitive operations on the
>> database and the associated data in a memory queue and write those  
>> ops
>> to disk as part of the txn commit process.  The pages and key/value
>> pairs that will be touched in that operation are also stored in that
>> txn log.
>> - when a transaction commits, it replays the log to write everything
>> to disk appropriately.  The list of touched data is then passed up  
>> the
>> commit chain to invalidate any pending transactions that have a
>> conflict.  Everything is speculative in this case, but we don't have
>> to deal with locking.
>>
>> This is a nice balance between some lisp-sexp serialization format
>> that performs poorly, and a highly-optimized low-level implementation
>> which is blindingly fast.
>>
>> A big decision is:
>> - Use cffi/uffi and do much of the serialization & btree
>> implementation in C/static memory
>>   or do all of this in pools of static arrays and write a new
>> serializer to operate on lisp data.
>>
>> I lean towards using cffi to manipulate static data, just because  
>> it's
>> going to be easier to get performance via that method and it's also
>> going to be much easier to do a multi-process implementation (by
>> operating on static memory and primitive locks in a shared memory
>> region).
>>
>> Predicated on that decision, getting started on the simplest possible
>> btree/dup-btree implementation is the next, most valuable and
>> educational step.
>>
>> The key pieces for a single-process lisp backend:
>> - btrees and dup-btrees (indices can be built from these two easily
>> enough)
>>   - the binary pages could be stored in static data and the
>> primitives btree ops
>>     could directly manipulate data within the page?  We pass a C
>> function that
>>     directly sorts binary sequences rather than having to deserialize
>> to sort.  We'd
>>     need to write that in lisp to operate on static data or on lisp
>> arrays.  Deserializing
>>     on each key comparison is too expensive.
>> - a set of transaction records (lisp structs and consts)
>>   - simply keeps tuples of (op {rd | wr} {btree+page-offset | value-
>> offset}  [values])
>>     in a memory queue.  Could use static memory for this to reduce
>> load on GC
>> - a blocking primitive library that serializes txn commits
>>   (i.e. write log to disk, write data to disk, write 'commit done' to
>> log,
>>    invalidate pending/conflicting txns)
>>
>> A nice auxiliary hack would be:
>> - rewrite memutils to entirely use uffi/cffi to manipulate static  
>> data
>> rather
>>   than calling out to C to do it.  Maintains efficiency but removes
>> the compilation
>>   build step except for users of BDB
>>
>>
>> So what do people think about the cffi->static-data vs. lisp->array-
>> pool decision?
>>
>>
>> Ian
>>
>> On May 13, 2008, at 2:03 PM, Leslie P. Polzer wrote:
>>
>>>
>>> I suppose the "binary paging" approach mentioned in the design
>>> considerations
>>> document means the problem of organizing the data efficiently on  
>>> disk
>>> right from the start. Is this correct?
>>>
>>> Do you think it would make good sense to start working on the btree
>>> library
>>> without thinking much about on-disk efficiency, leaving this part
>>> for later?
>>>
>>> I'm not sure a btree where on-disk storage organization is separeted
>>> from the
>>> rest like that can achieve enough efficiency...
>>>
>>> Thanks,
>>>
>>>   Leslie
>>>
>>> _______________________________________________
>>> elephant-devel site list
>>> elephant-devel at common-lisp.net
>>> http://common-lisp.net/mailman/listinfo/elephant-devel
>>
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel