[vivace-graph-devel] elephant

Wed Sep 14 02:36:03 UTC 2011

Hi Dan & Kevin,

On Tue, Sep 13, 2011 at 12:15 PM, Kevin Raison <raison at chatsubo.net> wrote:
> Dan, I actually already tried using elephant at a very early stage in the
> development of VG.  While elephant is an excellent library (which I have
> used in many projects), it is simply too slow to be of use here. One of the
> goals of VG is to be fast and to be able to handle billions of triples.

What kind of value are you shooting for here?
E.g. What do you think is a reasonable value for X?

 (funcall #'(lambda (x) (format nil "~R triples?" (* 1000 1000000 x))) ???)

FWIW I would suggest that X need not be anything larger than 8 :)

and that realistically this is a more reasonable upper bounds:

 (format nil "~R triples" #x7fffffff)

`cl:sxhash' return value HASH-CODE is specified as a non-negative fixnum:

 ,----
 | The HASH-CODE is intended for hashing.  This places no verifiable
 | constraint on a conforming implementation, but the intent is that
 | an implementation should make a good-faith effort to produce
 | HASH-CODES that are well distributed within the range of
 | non-negative fixnums.
 `----

So, assuming the underlying Lisp does make a good-faith effort to
distribute hash-codes across the range of 0,most-positive-fixnum if we
wanted one gimongous linearly-hashed table of ~8 billion triple thingies
wouldn't we need _at least_ a theoretical

   (format nil "~R fixnums" #X3FFFFFFFF)

for indexing such a gimongoid?

And if so, would we not run out of fixnums at:

 (format nil "somehwere around fixnum ~R" #x1FFFFFFF)

on an x86-32 SBCl?

In any event, assuming fixnums are the lightest possible serializable
key in a hash-table and that the range of thoese keys has a
performative upper bounds of #x1FFFFFFF on x86-32 SBCL and #x3FFFFFFFF
on x86-64 we're likely to require at least ~3 bytes of diskspace per
32bit key and ~7 bytes per 64bit key.

  (format nil
	"Noting that:~% ~R triples~% is a ~A bit number"
	#x3FFFFFFFF (integer-length (1- (ash 1 64))))

So serializing just the fixnum integer keys of
   (format nil "~R triples" #x7fffffff)
is liable to require

(format nil "somewhere less than ~D GB on a 32bit machinine"
	 (nth-value 0 (round (* #x7fffffff 3) (* 1024 1024 1024))))

(format nil "somewhere less than ~D GB on a 64bit machinine"
	 (nth-value 0 (round (* #x7fffffff 7) (* 1024 1024 1024))))

(format nil "Somewhere slighly less than ~D GB on a 32bit machine~%~
	     and somewhere sligthly less than ~D GB on a 64bit machine"
	 (nth-value 0 (round (* #x7fffffff 3) (* 1024 1024 1024)))
	 (nth-value 0 (round (* #x7fffffff 7) (* 1024 1024 1024))))

Please correct me if my math is out of whack?

Also, assuming these are reasonable amounts for serializing just the
hash-table fixnum integer keys of:
  (format nil "~R triples" #x7fffffff)

Is it not reasonable to assume that the above values might serve as a
good guidepost for what we might expect of an in-memory footprint of
the data-structures holding those:
 (format nil "~R triples" #x7fffffff)

Even with MMAPing there is there not still some significant overhead
associated with deserializing the MMAPPed data to something lispy?

>  Elephant slows down very quickly because of a number of factors, including
> its use of BerkeleyDB rather than a native Lisp back-end store, as well as

Not to mention there are some licensing issues surrounding BDB...
 ... FSVO "some" ...

> its complexity.  VG does not need the level of complexity or abstraction
> that you get with elephant's indexes and class redefinition logic.

Naively, I would expect the MOP stuff to be a factor. Is it?

> In our
> case, we are dealing with one class, the triple, and as such, we can be very
> specific about how we store and index it as well as how we deal with it in
> memory.  Standard b-trees simply won't efficiently handle the fanout of a
> large triple store;  we need something specifically tuned to our purpose.

Why not?

I'm under the impression that many of the big Linux distros will soon
release with Btrfs as the default file-system... and either Ted Tso is
sandbagging for google with his recent endoresement of Btrfs over ext4
or there must be at least some utility for B+trees :)

Also what is a "standard" b-tree?

FTR I confuse myself when referencing b-trees :)
It might be helpful to establish some prototocal for the
datastructures in question -- a wiki-link would suffice.

> am fairly certain that linear hashing for triple storage combined with
> b-tries or fb-trees for indexing would do much better.

While i'm not convinced that b-tries are TRT, I certainly agree that
linear hashing is (at least for in memory data)!

FWIW apropos all the recent UUID bit-vector timing junk i've posted here
recently I figured it might be prudent to take some measurements using
just 128-bit integers for indexing...

I was pleasantly surprised to find that even with the relatively big
128bit bignum's SBCL hash-table lookup is pretty damn snappy with
a large(ish) number of key/value pairs in the range of 500k-1mil

Indeed, once optimizations around the allocation of the unerlying hash-table
are made by massaging the value to make-hash-table's :SIZE keyword it
gets even better!

> There are other graph dbs out there that use this strategy.  See
> http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/
> for some good discussion.

One of the bullet-point API implementation details i found ineresting

 "Items are identified by a string unique
  identifier which maps to an integer index."

Is implying the effective equivalent of:
 (assoc  "stringy-id" '(("stringy-id" . 123456789)) :test 'equal)

Or the inverse:
   (assoc 123456789 '((123456789 . "stringy-id")))

??

,----
| This is another point that we break from typical database design
| theory. In a typical database you’d look up a record by checking for
| it in a B-tree and then go off to find the data pointer for its
| record, which might have a continuation record at the end that you
| have to look more stuff up in … and so on. Our lookups are constant
| time. We hash the string identifier to get an Index and then use that
| Index to find the appropriate Offsets for its data. These vectors are
| sequential on disk, rather than using continuation tables or something
| similar, which makes constant time lookups possible.
`---- Section "File-based Data Structures: Stack, Vector, Linear Hash"

which doesn't sound entirely unlike the toy example of self resolving
string-entity's using `initialize-context/``get-entity-in-context'
which i posted here the other day:

 http://lists.common-lisp.net/pipermail/vivace-graph-devel/2011-September/000008.html

>
> Another goal of mine is to develop a native Lisp back-end that projects like
> elephant might be able to take advantage of;
I would like to interject that while vivace-graph-v2 may not be
targetted as a full blown Persistent Object Store it does have the
potential to re-think some of the cool functionality of Statice using
SPROG instead of an object hierarchy:

http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/statice.html

> not relying on external,
> non-Lisp libraries is a good thing, especially BerkeleyDB, given its
> terrible licensing terms (thanks, Oracle).

Its not just Oracle that have left BDB license in shambles...

Regardless, I personally have a strong desire to keep integration with
external (read non-lispy) tools to a minimum.