[vivace-graph-devel] elephant

Wed Sep 14 20:01:29 UTC 2011

Comments inline below.

> Even with MMAPing there is there not still some significant overhead
> associated with deserializing the MMAPPed data to something lispy?

Yes, for persistence, this is unavoidable;  the solution is heavy 
caching and interning of strings (VG already does this).  I started work 
on some code a long time ago that worked like this:

One mmap'ed linear hash file per graph, with the triple-id as hash key. 
  The value slot is an offset (integer) into another mmap'ed file (or 
files) which is the actual triple storage area.  Because triples are 
never truly deleted, only a simple memory allocator is needed for the 
triple storage file (append only).  For indices, use some b-tree variant 
(possibly start with cl-btree: http://www.cliki.net/cl-btree or a 
b-trie: http://www.naskitis.com/naskitis-vldbj09.pdf) that maps keys to 
triple-ids.  So a non-cached lookup via triple-id would hit the mmap'ed 
hash table, get an offset into the triple storage area, deserialize the 
triple into the cache and return the struct (or vector if we want to be 
really efficient).  A search via an index would return a triple-id, 
which would hit the cache and return the triple or pass through to the 
hash table and repeat the deserialization process described above. 
AllegroGraph will use up as much memory as is available for caching, and 
with good reason;  the more triples we keep in memory, the less 
slow-down we get.  We might even have two layers of cache:  cache 
triples themselves, as well as queries that map to offsets in the data 
file or to triple-ids.  Also, since the hash table will be fairly small 
(integers -> integers), loading the whole thing into memory should be 
possible.

>> its complexity.  VG does not need the level of complexity or abstraction
>> that you get with elephant's indexes and class redefinition logic.
>
> Naively, I would expect the MOP stuff to be a factor. Is it?

Yes, who needs the MOP in this circumstance?  I would prefer triples to 
be stored in memory as vectors (using the args to defstruct to force the 
use of vector storage) for the sake of efficiency.

>> In our
>> case, we are dealing with one class, the triple, and as such, we can be very
>> specific about how we store and index it as well as how we deal with it in
>> memory.  Standard b-trees simply won't efficiently handle the fanout of a
>> large triple store;  we need something specifically tuned to our purpose.
>
> Why not?

Think about what a triple really is:  each S, P or O is a completely 
unique thing in the database.  For the triple (Kevin likes cats), three 
symbols are created: 'Kevin, 'likes, and 'cats.  Another triple, say 
(Kevin likes dogs) only creates one new symbol: 'dogs.  There are not 
now two 'Kevin's in the db, but two triples that reference that one 
symbol.  When indexing something like this in a btree, each symbol is a 
node in the tree, and each level of the btree would correspond to a slot 
in the triple;  for example, to index in order of subject, predicate, 
object, the tree would be structured as

      S
     / \
    P   P
   / \   \
  O   O   O

Because nodes don't repeat and are atomic symbols, traversing the tree 
would be a linear search at each level.  In a b-trie 
(http://www.naskitis.com/naskitis-vldbj09.pdf), you could effectively 
string the S, P and O together and create more reasonably branching 
search paths.  For example, to index the two triples mentioned above 
with a third, (Kevin loves pizza), you would have a tree like:

                             K
                            /
                           E
                          /
                         V
                        /
                       I
                      /
                     N
                    /
                   L
                  / \
                 I   O
                /     \
               K       V
              /         \
             E           E
            /             \
           S               S
          / \               \
      CATS   DOGS            PIZZA
        /     \               \
       ID     ID               ID

This would also allow for substring matching in a very simple way.  Read 
the paper for more details and a comparison to B+ trees.

>> Another goal of mine is to develop a native Lisp back-end that projects like
>> elephant might be able to take advantage of;
> I would like to interject that while vivace-graph-v2 may not be
> targetted as a full blown Persistent Object Store it does have the
> potential to re-think some of the cool functionality of Statice using
> SPROG instead of an object hierarchy:
>
> http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/statice.html

I don't have time to look at this right now;  will revisit and comment 
later.

>> not relying on external,
>> non-Lisp libraries is a good thing, especially BerkeleyDB, given its
>> terrible licensing terms (thanks, Oracle).
>
> Its not just Oracle that have left BDB license in shambles...
>
> Regardless, I personally have a strong desire to keep integration with
> external (read non-lispy) tools to a minimum.

YES!

-K