[elephant-devel] Lisp Btrees: design considerations
Ian Eslick
eslick at media.mit.edu
Wed May 14 14:13:14 UTC 2008
Here is the reference: http://www.paulgraham.com/carl.html
I was quite wrong-headed on the performance point. The biggest time
cost in database operations is time spent moving data to/from disk, in
comparison CPU time is quite small. I was thinking about the internal
copying overhead, which is still quite small.
Oracle and others have had good success with compression. To get the
juices flowing, this is a very clear (although a bit dated)
articulation of the ideas: http://web.cecs.pdx.edu/~len/compression.pdf
Ian
On May 14, 2008, at 9:53 AM, Robert L. Read wrote:
> I am certainly in favor of using fixed-size pages as a basic model. I
> think this is very common. It has the advantage of allowing page-
> based
> caching more easily. One can even contemplate things like a RAID-
> style
> striping across machines on a LAN or WAN.
>
> I think we should create a really pure-lisp solution, that doesn't
> on C
> in any way. Certainly we should do that first.
>
> However, I have personal reason for preferring a completely LISP-based
> serializer: I think there is a lot of opportunity for major
> performance
> enhancements by making the serializer more efficient---by which I mean
> producing very compressed representations of the data serialized. For
> example, if we could produce a 10-fold reduction in the size of the
> compressed data, we would have a game-changing improvement, which
> would
> probably make us much faster than BDB or any relational system. If
> that
> is true, then almost any amount of CPU time spent in the codec is
> warranted.
>
> A 10-fold improvement may seem unlikely until one stops to consider
> the
> applicability of Huffman coding of English strings, prefix-forward-
> index
> encoding of ordered strings, Huffman encoding on our little
> type-designators, the possibility of keeping a dictionary in a cached
> page for LZW-style compression, the possibility of detecting a Poisson
> distribution of integer values by declaration or statistics, etc.
> However, I think this is a very "researchy" line of thought. I would
> like to have the ability to explore these ideas by writing them up in
> LISP code, even though they definitely would represent premature
> optimization until we had a basic B-Tree working.
>
> It's not 100% clear to me that C would offer a more performant
> solution
> in any case. It is probably true that a high-performance serializer
> would have to be tuned to various LISP implementations.
>
>
>
> On Tue, 2008-05-13 at 15:46 -0400, Ian Eslick wrote:
>> I think they key decision was what serialization format we're going
>> to
>> use for btree nodes, log entries, etc and how that relates to caching
>> data during transactions, maintaining empty lists, etc.
>>
>> The current serializer produces a byte sequence. If we continue with
>> that model, how do we write/read this stuff from disk? How and where
>> do we store it prior to committing a transaction?
>>
>> When we create a new key or value as a binary stream within a
>> transaction, how is it stored in memory? If we want a multi-process,
>> but non-socket based approach, we need to figure out how to store
>> data
>> in shared memory, etc.
>>
>> For example, in BDB, the primitive is the page. BTree nodes are
>> layed
>> out in one or more pages, each page has some binary metadata
>> explaining it's type and organization (free list, etc). A short key
>> value is written directly into the page, a long one is written into
>> an
>> overflow page, etc. Lots of details to deal with in managing
>> variable
>> sized data on disk. Pages that are dirty are kept in memory (which
>> is
>> why BDB can run out of transaction space; the pages overflow the max
>> cache size when you are writing lots of data).
>>
>>
>> However, to get started, the easiest thing is to reuse the existing
>> memutils serializer, not worry about multi-process operation and not
>> worry about fragmentation, sharing space and maintaining free lists
>> (except perhaps for btree nodes).
>>
>> Something like:
>> - btree nodes only keep pointers to variable sized keys stored
>> elsewhere in the file
>> - new keys and values of differing or less length are written in
>> place, otherwise new
>> space is allocated at the end of the file.
>> - btree nodes are a fixed size page on-disk and keep some free-list
>> information so we can reuse them.
>> - transactions simply keep track of the primitive operations on the
>> database and the associated data in a memory queue and write those
>> ops
>> to disk as part of the txn commit process. The pages and key/value
>> pairs that will be touched in that operation are also stored in that
>> txn log.
>> - when a transaction commits, it replays the log to write everything
>> to disk appropriately. The list of touched data is then passed up
>> the
>> commit chain to invalidate any pending transactions that have a
>> conflict. Everything is speculative in this case, but we don't have
>> to deal with locking.
>>
>> This is a nice balance between some lisp-sexp serialization format
>> that performs poorly, and a highly-optimized low-level implementation
>> which is blindingly fast.
>>
>> A big decision is:
>> - Use cffi/uffi and do much of the serialization & btree
>> implementation in C/static memory
>> or do all of this in pools of static arrays and write a new
>> serializer to operate on lisp data.
>>
>> I lean towards using cffi to manipulate static data, just because
>> it's
>> going to be easier to get performance via that method and it's also
>> going to be much easier to do a multi-process implementation (by
>> operating on static memory and primitive locks in a shared memory
>> region).
>>
>> Predicated on that decision, getting started on the simplest possible
>> btree/dup-btree implementation is the next, most valuable and
>> educational step.
>>
>> The key pieces for a single-process lisp backend:
>> - btrees and dup-btrees (indices can be built from these two easily
>> enough)
>> - the binary pages could be stored in static data and the
>> primitives btree ops
>> could directly manipulate data within the page? We pass a C
>> function that
>> directly sorts binary sequences rather than having to deserialize
>> to sort. We'd
>> need to write that in lisp to operate on static data or on lisp
>> arrays. Deserializing
>> on each key comparison is too expensive.
>> - a set of transaction records (lisp structs and consts)
>> - simply keeps tuples of (op {rd | wr} {btree+page-offset | value-
>> offset} [values])
>> in a memory queue. Could use static memory for this to reduce
>> load on GC
>> - a blocking primitive library that serializes txn commits
>> (i.e. write log to disk, write data to disk, write 'commit done' to
>> log,
>> invalidate pending/conflicting txns)
>>
>> A nice auxiliary hack would be:
>> - rewrite memutils to entirely use uffi/cffi to manipulate static
>> data
>> rather
>> than calling out to C to do it. Maintains efficiency but removes
>> the compilation
>> build step except for users of BDB
>>
>>
>> So what do people think about the cffi->static-data vs. lisp->array-
>> pool decision?
>>
>>
>> Ian
>>
>> On May 13, 2008, at 2:03 PM, Leslie P. Polzer wrote:
>>
>>>
>>> I suppose the "binary paging" approach mentioned in the design
>>> considerations
>>> document means the problem of organizing the data efficiently on
>>> disk
>>> right from the start. Is this correct?
>>>
>>> Do you think it would make good sense to start working on the btree
>>> library
>>> without thinking much about on-disk efficiency, leaving this part
>>> for later?
>>>
>>> I'm not sure a btree where on-disk storage organization is separeted
>>> from the
>>> rest like that can achieve enough efficiency...
>>>
>>> Thanks,
>>>
>>> Leslie
>>>
>>> _______________________________________________
>>> elephant-devel site list
>>> elephant-devel at common-lisp.net
>>> http://common-lisp.net/mailman/listinfo/elephant-devel
>>
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
More information about the elephant-devel
mailing list