From monkey at sandpframing.com  Sun Sep  4 21:11:15 2011
From: monkey at sandpframing.com (MON KEY)
Date: Sun, 4 Sep 2011 17:11:15 -0400
Subject: [vivace-graph-devel] Recent Babel changes - discussion from github
Message-ID: <CACU6nW0H+u=Fjxuwp32Mijh-8oMTrbA5ZTELje_a9HhxugwMyg@mail.gmail.com>

Hello,

I recently commented on github w/r/t a recent commit which creates a
new dependency on Babel:

(URL `https://github.com/mon-key/vivace-graph-v2/commit/8a91ff8c52b87411bb9a816b3094c12d15ea69ed#commitcomment-568271')

As the devel-mailing list is the more appropriate forum for this I
figure i should forward my comments here and any follow up should
occur here. Following is a transcript of the initial exchange:

,----
| mon-key added a note to 8a91ff8 repo owner
|
| Was there a specific reason for the change to
| babel:string-to-octets?Was there a specific reason for the change to
| babel:string-to-octets?
`----

,----
| kraison added a note to 8a91ff8
|
| portability. some day, when I actually get around to finishing this
| application, I would like it to be portable across Lisps. are you
| using vivace-graph? did this change cause any troubles?
`----

I don't find any immediate trouble :)

I understand that the the move to Babel is one step towards removing
reliances on SBCL internals.

This said, I would think `flexi-streams:octets-to-string' might be a
better choice esp. in so much as the longterm goals / requirements of
vivace-graph-v2 are likely to eventually require some use of a
portable streams library...

At the very least, in so much as the callers of BABEL:OCTETS-TO-STRING
BABEL:STRING-TO-OCTETS are (de)serializers immediately associated with
salza2 and chipz compression, it may be worth considering that both
salza2 and chipz can frob to/from octet stream directly and that
flexi-streams more immediately capable of taking advantage of this
than Babel's babel-streams. A quick glance at the file header of
babel/src/streams.lisp may convince you that Babel is an inferior
substitute for Flexi-Streams. ;-}

FWIW I'm fooling around with a forked v2 - the mon_key branch holds
the current changes. My intent is to explore ways in which
vivace-graph might benefit from incorporation of my uuid system
Unicly: https://github.com/mon-key/unicly

The Unicly system has a dependency on flexi-streams's
`string-to-octets'.  Likwise, the uuid system has a dependency on
trivial-utf-8's `string-to-octets'.  To that end, my immediate concern
is that there is little need in providing an additional UTF-8
portability package.

Following are the respective signatures of three functions each of
which satisfy nearly identical requirements:

 flexi-streams:octets-to-string
 babel:octets-to-string
 trivial-utf-8:utf-8-bytes-to-string

;; sb-ext:octets-to-string
(vector &key
        (start 0)
        end
        (external-format default))

;; flexi-streams:octets-to-string
(sequence &key
          (start 0)
          (end (length sequence))
          (external-format :latin1))

;; babel:octets-to-string
(vector &key
        (start 0)
        end
        (encoding *default-character-encoding*)
        (errorp (not *suppress-character-coding-errors*)))

;; trivial-utf-8:utf-8-bytes-to-string
(bytes-in &key
          (start 0)
          (end (length bytes-in)))

Also, note that where both the Unicly and uuid systems have a
dependency on ironclad, it might be possible to shoehorn an
octet-to-string function using the stuff in octet-streams.lisp
esp. where Ironclad _appears_ to provide have a fairly portable and
self-contained implementation of Gray Stream binary streams already.


From monkey at sandpframing.com  Sun Sep  4 21:28:20 2011
From: monkey at sandpframing.com (MON KEY)
Date: Sun, 4 Sep 2011 17:28:20 -0400
Subject: [vivace-graph-devel] triple-equal implemented as v1 uuid
	comparison? -- issue 1 on github
Message-ID: <CACU6nW0k+VmByBLiY+m9PSWbp918arTg1HSSpvF3XeQta6GB2Q@mail.gmail.com>

"triple-equal implemented as v1 uuid comparison?"

I'm forwarding a transcript of disccussion occuring on github as
vivace-graph-v2 issue "triple-equal implemented as v1 uuid comparison?"
(URL `https://github.com/kraison/vivace-graph-v2/issues/1')

Hopefully this will faciliate continuing the discussion here on the
mailing list instead.

,----
| danlentz opened this issue about 11 hours ago
| triple-equal implemented as v1 uuid comparison?
|
| Does that yield the appropriate semantics? For example if i happen to
| compare it with a triple with the same s p o from a different graph,
| i'd expect them to unify. What prevents duplicate triples from leaking
| into the same graph?
`----

,----
| kraison commented
|
| Dan, would you mind bringing this discussion over the the mailing
| list? You can join it here:
| http://common-lisp.net/project/vivace-graph/
`----

,----
| kraison commented
|
| take a look at add-triple in triples.lisp to see how duplicates are
| avoided. as for unification, there is not an easy answer to how
| triples from different graphs should unify. calling the basic data
| type in VG a triple is perhaps a misnomer; it is really a quint, with
| the graph as one of the 5 slots (the other 4 being s, p, o and
| id). so, i think that if you want to unify across graphs there should
| be some sort of special functor for that purpose, otherwise you
| confound the notion of equality. i see triple-equal as being more akin
| to Lisp's EQL, which compares addresses and not content; the triple's
| uuid is essentially its address. this is of course debatable and i am
| happy to hear dissenting opinions. :) also, see the definitions of
| q-/4 and q-/3 in prolog-functors.lisp for how unification is done.
`----

,----
| danlentz    commented
|
| wouldn't that be more equivalent to triple-eq? You're only matching
| on a single instance of the triple. If you delete (really delete i
| mean) and re-add the same spog they wont be eql. wilbur interns
| quads by string comparison of spo constituent node-uri and
| de.setf.resource does sha1 hex-id on the byte array. Well i'll play
| around & also make sure to join the mailing list too.
|
| Enjoy your vacation I look forward to talking again.
`----

,----
| mon-key commented
|
| Hi kraison and danlentz
|
| I think there is room for lots of discussion around VG's
| triple-equality and I would love to contribute and learn more.
|
| Can we take this up on the vivace-graph-devel mailing list?
|
| Tracking communication on github across multiple repos and branches is
| a PITA :)
`----


From raison at chatsubo.net  Sun Sep  4 21:27:19 2011
From: raison at chatsubo.net (Kevin Raison)
Date: Sun, 04 Sep 2011 14:27:19 -0700
Subject: [vivace-graph-devel] triple-equal semantics
Message-ID: <4E63ED37.8010605@chatsubo.net>

>From the github discussion:

Dan:
triple-equal implemented as v1 uuid comparison?
Does that yield the appropriate semantics? For example if i happen to
compare it with a triple with the same s p o from a different graph, i'd
expect them to unify. What prevents duplicate triples from leaking into
the same graph?

Kevin:
take a look at add-triple in triples.lisp to see how duplicates are
avoided. as for unification, there is not an easy answer to how triples
from different graphs should unify. calling the basic data type in VG a
triple is perhaps a misnomer; it is really a quint, with the graph as
one of the 5 slots (the other 4 being s, p, o and id). so, i think that
if you want to unify across graphs there should be some sort of special
functor for that purpose, otherwise you confound the notion of equality.
i see triple-equal as being more akin to Lisp's EQL, which compares
addresses and not content; the triple's uuid is essentially its address.
this is of course debatable and i am happy to hear dissenting opinions.
:) also, see the definitions of q-/4 and q-/3 in prolog-functors.lisp
for how unification is done.

Dan:
wouldn't that be more equivalent to triple-eq? You're only matching on a
single instance of the triple. If you delete (really delete i mean) and
re-add the same spog they wont be eql. wilbur interns quads by string
comparison of spo constituent node-uri and de.setf.resource does sha1
hex-id on the byte array. Well i'll play around & also make sure to join
the mailing list too.

Enjoy your vacation I look forward to talking again.


-K


From monkey at sandpframing.com  Wed Sep  7 08:27:21 2011
From: monkey at sandpframing.com (MON KEY)
Date: Wed, 7 Sep 2011 04:27:21 -0400
Subject: [vivace-graph-devel] every time we UUID 128 bits die down the bit
	hole
Message-ID: <CACU6nW3VnZUb7i4M0CgYg2JXrrzVNQ3_fjOvRzvkQjwySH=acA@mail.gmail.com>

While reviewing Franz's documentation of Agraph's UPIs:

http://www.franz.com/agraph/support/documentation/current/lisp-reference.html#function.make-upi

it occured to me that vivace-graph-v2 should consider using Unicly
https://github.com/mon-key/unicly rather the current uuid library.

Obviously I'm biased :)

In any event, its pretty clear that Franz is using some form of UUID
truncated from 16 to 12 bytes for maintaining triple identity.

What isn't clear is whether the top four bytes are needed for type
addressing by the underlying Lisp or if the decision had more to do
with a performance bottleneck with frobbing ~128bit normative UUIDs
(e.g. as per RFC 4122).

Regardless, vivace-graph-v2 should move away from uuid:make-v1-uuid
(its slow, ugly, and buggy) I would suggest that there may be some
significant gains to be had by:

 a) taking advantage of Unicly's fast v3 and v5 UUID generation I'm
    convinced that vivace-graph-v2 could benefit by caching UUID
    namespaces for its various triple indexes and using these to
    generate v3/v5 UUIDs instead of the current scheme of constantly
    hashing up disposable UUIDs by banging on the system clock!

 b) utilizng Unicly's ability to convert UUIDs to/from various
    representations it might be possible to extend Unicly's bit-vector
    UUID representation out beyond 128 bits in order to allow triples
    to carry type information. Tacking one more octet (#*11111111)
    onto a Unicly UUID bit-vector would buy a lot of space to address
    types. Likewise, taking advantage of Unicly's ability to convert
    UUIDs to integer values would prob. aid certain Btree schemes by
    branching on numeric greater/lessthan as opposed to lexical
    schemes which frob string-greater/string-lessthan


From raison at chatsubo.net  Wed Sep  7 19:53:54 2011
From: raison at chatsubo.net (Kevin Raison)
Date: Wed, 07 Sep 2011 12:53:54 -0700
Subject: [vivace-graph-devel] every time we UUID 128 bits die down the
 bit hole
In-Reply-To: <CACU6nW3VnZUb7i4M0CgYg2JXrrzVNQ3_fjOvRzvkQjwySH=acA@mail.gmail.com>
References: <CACU6nW3VnZUb7i4M0CgYg2JXrrzVNQ3_fjOvRzvkQjwySH=acA@mail.gmail.com>
Message-ID: <4E67CBD2.6060908@chatsubo.net>

I am convinced that this is an excellent idea;  I also noticed that you
have been working on it in your github branch of VG.  Let me know when
it is ready to merge into the mainline so that we can play around.

Also, to justify the original use of make-v1-uuid:  it was simply easy
and worked well enough.  Now that there are others interested in this
project, it is definitely time to use a better solution.

Cheers,
Kevin


On 9/7/11 1:27 AM, MON KEY wrote:
> While reviewing Franz's documentation of Agraph's UPIs:
> 
> http://www.franz.com/agraph/support/documentation/current/lisp-reference.html#function.make-upi
> 
> it occured to me that vivace-graph-v2 should consider using Unicly
> https://github.com/mon-key/unicly rather the current uuid library.
> 
> Obviously I'm biased :)
> 
> In any event, its pretty clear that Franz is using some form of UUID
> truncated from 16 to 12 bytes for maintaining triple identity.
> 
> What isn't clear is whether the top four bytes are needed for type
> addressing by the underlying Lisp or if the decision had more to do
> with a performance bottleneck with frobbing ~128bit normative UUIDs
> (e.g. as per RFC 4122).
> 
> Regardless, vivace-graph-v2 should move away from uuid:make-v1-uuid
> (its slow, ugly, and buggy) I would suggest that there may be some
> significant gains to be had by:
> 
>  a) taking advantage of Unicly's fast v3 and v5 UUID generation I'm
>     convinced that vivace-graph-v2 could benefit by caching UUID
>     namespaces for its various triple indexes and using these to
>     generate v3/v5 UUIDs instead of the current scheme of constantly
>     hashing up disposable UUIDs by banging on the system clock!
> 
>  b) utilizng Unicly's ability to convert UUIDs to/from various
>     representations it might be possible to extend Unicly's bit-vector
>     UUID representation out beyond 128 bits in order to allow triples
>     to carry type information. Tacking one more octet (#*11111111)
>     onto a Unicly UUID bit-vector would buy a lot of space to address
>     types. Likewise, taking advantage of Unicly's ability to convert
>     UUIDs to integer values would prob. aid certain Btree schemes by
>     branching on numeric greater/lessthan as opposed to lexical
>     schemes which frob string-greater/string-lessthan
> 
> _______________________________________________
> vivace-graph-devel mailing list
> vivace-graph-devel at common-lisp.net
> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel


From monkey at sandpframing.com  Fri Sep  9 07:48:15 2011
From: monkey at sandpframing.com (MON KEY)
Date: Fri, 9 Sep 2011 03:48:15 -0400
Subject: [vivace-graph-devel] every time we UUID 128 bits die down the
	bit hole
In-Reply-To: <4E67CBD2.6060908@chatsubo.net>
References: <CACU6nW3VnZUb7i4M0CgYg2JXrrzVNQ3_fjOvRzvkQjwySH=acA@mail.gmail.com>
	<4E67CBD2.6060908@chatsubo.net>
Message-ID: <CACU6nW3oBJRRhM6=jOHeccE4D0NNxPFoBN7DtNNbdTDJnbsdeA@mail.gmail.com>

On Wed, Sep 7, 2011 at 3:53 PM, Kevin Raison <raison at chatsubo.net> wrote:
> I am convinced that this is an excellent idea; ?I also noticed that you
> have been working on it in your github branch of VG. ?Let me know when
> it is ready to merge into the mainline so that we can play around.
>
Following illustrates some possilbe utiliy of Unicly w/r/t indexes

(requires Unicly from Git)

(defvar *global-entity-index*      (make-hash-table :test 'unicly:uuid-eql))

(defconstant +context-Z-namespace-as-ub128+
317192554773903544674993329975922389959)

(defconstant +context-Y-namespace-as-ub128+
003012593477302450121124084036000723448)

(defvar *context-Z* '())

(defvar *context-Y* '())

(defclass context ()
  ((namespace
    :reader namespace)
   (namespace-uuid
    :reader namespace-uuid)
   (namespace-table
    :reader namespace-table)
   (namespace-index
    :reader namespace-index)))

(defun initialize-context (integer global-idx)
  (let ((instance (make-instance 'context)))
    (setf (slot-value instance 'namespace)
          (unicly:uuid-from-bit-vector
           (unicly::uuid-integer-128-to-bit-vector
            integer)))
    (setf (slot-value instance 'namespace-uuid)
          (unicly:make-v5-uuid (namespace instance)
                               (unicly:uuid-princ-to-string (namespace
instance))))
    (setf (slot-value instance 'namespace-table)
          (make-hash-table :test 'unicly:uuid-eql))
    (setf (slot-value instance 'namespace-index)
          global-idx)
    (setf (gethash (namespace instance)
                   (namespace-index instance))
          (namespace-uuid instance))
    (setf (gethash (namespace-uuid instance)
                   (namespace-index instance))
          (namespace-table instance))
    instance))

(defun get-entity-in-context (string-entity context-instance &key
(set-if-not nil))
  (declare (string string-entity)
           (boolean set-if-not)
           (context context-instance))
  (let ((entity-uuid (unicly:make-v5-uuid (namespace context-instance)
                                          string-entity))
        (index (namespace-index context-instance))
        (did-set '()))
    (labels ((get-global-entity-uuid ()
               (gethash entity-uuid index))
             (set-global-entity-uuid ()
               (setf (gethash entity-uuid index) string-entity
                     did-set t)
               entity-uuid)
             (unset-whatset-global-entity ()
               (remhash entity-uuid index)
               (setf did-set nil)
               (return-from get-entity-in-context (values nil nil)))
             (global-entity-chk ()
               (let ((entity-if (get-global-entity-uuid)))
                 (etypecase entity-if
                   (null
                    (if set-if-not
                        (set-global-entity-uuid)
                        (return-from get-entity-in-context nil)))
                   (string
                    (if (string= entity-if string-entity)
                        entity-uuid
                        (return-from get-entity-in-context
                          (when set-if-not (values nil did-set))))))))
             (deref-context-table ()
               (let* ((entity-chk    (global-entity-chk))
                      (context-chk   (gethash (namespace
context-instance) index))
                      (context-deref (if context-chk
                                         ;; we have the uuid of
context-namespace
                                         (gethash context-chk index)
                                         (cond (did-set
                                                (unset-whatset-global-entity))
                                               (t
                                                (return-from
get-entity-in-context nil)))))
                      (table-deref   (if context-deref
                                         ;; we have the assoicated
context hash-table
                                         context-deref
                                         (cond (did-set
                                                (unset-whatset-global-entity))
                                               (t
                                                (return-from
get-entity-in-context nil)))))
                      (table-get-if  (gethash entity-chk table-deref)))
                 (if table-get-if
                     table-get-if
                     (when set-if-not
                       (setf (gethash entity-chk table-deref) string-entity
                             did-set t))))))
      (if set-if-not
          (values (deref-context-table) did-set)
          (deref-context-table)))))

(setf *context-Z* (initialize-context +context-Z-namespace-as-ub128+
                                      *global-entity-index*))

(setf *context-Y* (initialize-context +context-Y-namespace-as-ub128+
                                      *global-entity-index*))

(setf (gethash (unicly:make-v5-uuid (namespace *context-Z*) "ENTITY-W")
               (namespace-index *context-Z*))
      "ENTITY-W")

(setf (gethash (unicly:make-v5-uuid (namespace *context-Z*) "ENTITY-W")
               (namespace-table *context-Z*))
      "ENTITY-W")

(setf (gethash (unicly:make-v5-uuid (namespace *context-Y*) "ENTITY-W")
               (namespace-index *context-Y*))
      "ENTITY-W")

(setf (gethash (unicly:make-v5-uuid (namespace *context-Y*) "ENTITY-W")
               (namespace-table *context-Y*))
      "ENTITY-W")

(get-entity-in-context "ENTITY-W" *context-Z*)
(get-entity-in-context "ENTITY-W" *context-Y*)

(get-entity-in-context "ENTITY-O" *context-Z*)
(get-entity-in-context "ENTITY-O" *context-Z* :set-if-not t)

(get-entity-in-context "ENTITY-O" *context-Y*)
(get-entity-in-context "ENTITY-O" *context-Y* :set-if-not t)

> Also, to justify the original use of make-v1-uuid: ?it was simply easy
> and worked well enough. ?Now that there are others interested in this
> project, it is definitely time to use a better solution.

Great! I'm glad you agree.

FTR following are the timings i get using the uuid library from Quicklisp
comparing make-v1-uuid with make-v4-uuid

(sb-ext:gc :full t)

(time
 (dotimes (i 10000)
   (uuid:make-v1-uuid)))
;=>Evaluation took:
;   8.396 seconds of real time
;   0.500924 seconds of total run time (0.326950 user, 0.173974 system)
;   5.97% CPU
;   25,125,295,748 processor cycles
;   8,235,056 bytes consed


(sb-ext:gc :full t)

(time
 (dotimes (i 10000)
   (uuid:make-v4-uuid)))
;=>Evaluation took:
;  0.020 seconds of real time
;  0.018998 seconds of total run time (0.017998 user, 0.001000 system)
;  95.00% CPU
;  58,466,168 processor cycles
;  2,311,272 bytes consed

For posterity here's an illustration of _why_ make-v1-uuid is such a dog:

(in-package #:uuid)

;; Redefine uuid::get-timestamp to show us when it calls sleep:
(let ((uuids-this-tick 0)
      (last-time 0)
      ;; add a counter to track how times sleep has been called
      (sleep-count 0))
  (defun get-timestamp ()
    "Get timestamp, compensate nanoseconds intervals"
    (unwind-protect
         (tagbody
          restart
            (let ((time-now (+ (* (get-universal-time) 10000000)
100103040000000000)))
					;10010304000 is time between 1582-10-15 and 1900-01-01 in seconds
              (cond ((not (= last-time time-now))
                     (setf uuids-this-tick 0
                           last-time time-now)
                     (return-from get-timestamp time-now))
                    (T
                     (cond ((< uuids-this-tick *ticks-per-count*)
                            (incf uuids-this-tick)
                            (return-from get-timestamp (+ time-now
uuids-this-tick)))
                           (T
                            ;; add a logging form to show us how many
times we sleep per invocation:
                            (format t "slept count: ~D~%" (incf sleep-count))
                            (sleep 0.0001)
                            (go restart)))))))
      (setf sleep-count 0))))

(dotimes (i 10000)
 (uuid:make-v1-uuid))

`uuid:make-v1-uuid' relies on `uuid::get-timestamp' which evaluates
(sleep 0.0001) quite a bit (at least on my
machine 32bit x86 running SBCL 1.50)


From monkey at sandpframing.com  Sun Sep 11 20:00:54 2011
From: monkey at sandpframing.com (MON KEY)
Date: Sun, 11 Sep 2011 16:00:54 -0400
Subject: [vivace-graph-devel] Recent Babel changes - discussion from
	github
In-Reply-To: <CACU6nW0H+u=Fjxuwp32Mijh-8oMTrbA5ZTELje_a9HhxugwMyg@mail.gmail.com>
References: <CACU6nW0H+u=Fjxuwp32Mijh-8oMTrbA5ZTELje_a9HhxugwMyg@mail.gmail.com>
Message-ID: <CACU6nW1zEy_kZASZ_KABgcoKqjqs_wCUNvMO1iiHWwwBxjxcBA@mail.gmail.com>

vivace-graph-v2 + UTF-8

Per kraison's recent inability to build Unicly with SBCL on MacOS and Lispworks
it may be worth considering now how vivace-graph-v2 will handle similar issues.

At issue is whether vivace-graph-v2 should constrain all string data to be
contained of characters encoded as UTF-8.

I would think that this is a reasonable contraint to apply given that
vivace-graph-v2 is intent on targeting RDF-aware and/or RDF-like applications
where UTF-8 has guaranteed ubiquity.

Indeed, for my part I have a direct and standing need to represent triple
subjects and objects as string data in character sets with encodings that extend
beyond ASCII and LATIN-1 and will consider it a deal breaker if vivace-graph-v2
is unable to reliably handle UTF-8.

Regardless, to the extent by which it is deemed desirable for vivace-graph-v2 to
enforce UTF-8 constraints around the string data it manipulates it is worth
considering how the current system might reliably and reasonably enforce such a
constraint should the underlying system prove incapable of internally handling
UTF-8 character encodings.

As it stands now, a method `serialize' in vivace-graph/serialize.lisp relies on
`babel:string-to-octets' and two `deserialize' methods in
vivace-graph-v2/deserialize.lisp reliy on `babel:octets-to-string'.

Currently vivace-graph-v2 has a dependency on the Babel system for converting
strings to/from octets via Babel functions `babel:octets-to-string' and
`babel:string-to-octets' which each default their :ENCODING keyowrd argument to
value `babel-encodings:*default-character-encoding*' which itself defaults to
:UTF-8.  IOW, unless explicitly specified otherwise both
`babel:octets-to-string' and `babel:string-to-octets' will default all
string/octet conversions to :UTF-8 and error in the event that the defaulting
behaviour is not supported by the underlying lisp implementation as per their
defaulting keyword forms: (errorp (not *suppress-character-coding-errors*))

In any event, there may be some potential for vivace-graph-v2's
serialization/deserialization routines to fail at inoportune moments given the
following from the file header of babel/src/strings.lisp

,----
| The usefulness of this string/octets interface of Babel's is very
| limited on Lisps with 8-bit characters which will in effect only
| support the latin-1 subset of Unicode.  That is, all encodings are
| supported but we can only store the first 256 code points in Lisp
| strings.  Support for using other 8-bit encodings for strings on
| these Lisps could be added with an extra encoding/decoding step.
| Supporting other encodings with larger code units would be silly
| (it would break expectations about common string operations) and
| better done with something like Closure's runes.
`----

The Closure system has a direct dependency on both Babel and an indirect
dependency on Flexi-Streams via Closure-html. Which is to say, there is no
reason why either the Babel or Flexi-Streams systems should be prefered over the
other in-so-muchas both are likely to remain dependencies of the vivace-graph-v2
system.

As mentioned already, my personal preference w/r/t to UTF-8 and
character-encoding/character
conversion interop is for the Flexi-Streams system and not the Babel
system. This preference (mostly trivial) is by no means a knock on the Babel
system, and mostly amounts to my belief that Flexi has more transparent argument
signatures with equivalent SBCL procedures.

vivace-graph-v2 currently already has an indirect dependency via its dependency
on the hunchentoot system (albeit currently un-needed) in turn Hunchentoot has a
dependency on flexi-streams.

In the event that vivace-graph-v2 should ever incorporate a direct mechanism for
frobbing RDF data such mechanism is very likely to necessitate a dependency on
the CXML system which currently has a pre-existing dependency on Closure-Html
system which in turn currently has a dependency on the Flexi-Streams system.


From monkey at sandpframing.com  Mon Sep 12 03:53:20 2011
From: monkey at sandpframing.com (MON KEY)
Date: Sun, 11 Sep 2011 23:53:20 -0400
Subject: [vivace-graph-devel] Rucksack Btree number-indexes on UUID
 integer-128 representations -- a sketch
Message-ID: <CACU6nW1r0vNCdiAVrZaaGJdOdncWxV+x5dckJ2H4mADtt=ZTjw@mail.gmail.com>

Utilizing the idea sketched below could get us much of the way towards a
functional implementation of Rucksack's Btree number-indexes on UUIDs...

Following is a toy prototype which extends Unicly's UUID class
UNIQUE-UNIVERSAL-IDENTIFIER by subclassing UUID-INDEXABLE-V5 and
adding two new slots which "self host" bit-vector and integer-128s
representations. These are populated in an after method specialized on
the class UUID-INDEXABLE-V5.

e.g. with an index spec like:
 (btree :key< uuid-<
        :value= uuid-eql
        :value-type persistent-object)

(defclass uuid-indexable-v5 (unicly:unique-universal-identifier)
  ((bit-vector
    :reader bit-vector-of-uuid)
   (integer-128
    :reader integer-128-of-uuid)))

(defmethod initialize-instance :after ((obj uuid-indexable-v5)
                                       &key &allow-other-keys)
  (setf (slot-value obj 'bit-vector)
        (unicly:uuid-to-bit-vector obj))
  (setf (slot-value obj 'integer-128)
        (unicly::uuid-bit-vector-to-integer (slot-value obj 'bit-vector))))

(declaim (inline digested-v5-uuid-indexed))
(defun digested-v5-uuid-indexed (v5-digest-byte-array)
  (declare (type unicly::uuid-byte-array-20
                 unicly::v5-digest-byte-array)
           (inline unicly::%uuid_time-low-request
                   unicly::%uuid_time-mid-request
                   unicly::%uuid_time-high-and-version-request
                   unicly::%uuid_clock-seq-and-reserved-request
                   unicly::%uuid_node-request)
           (optimize (speed 3)))
  (the uuid-indexable-v5
    (make-instance 'uuid-indexable-v5
     :%uuid_time-low (unicly::%uuid_time-low-request v5-digest-byte-array)
     :%uuid_time-mid (unicly::%uuid_time-mid-request v5-digest-byte-array)
     :%uuid_time-high-and-version
(unicly::%uuid_time-high-and-version-request v5-digest-byte-array 5)
     :%uuid_clock-seq-and-reserved
(unicly::%uuid_clock-seq-and-reserved-request v5-digest-byte-array)
     :%uuid_clock-seq-low (the unicly::uuid-ub8
(unicly::%uuid_clock-seq-low-request v5-digest-byte-array))
     :%uuid_node (unicly::%uuid_node-request v5-digest-byte-array))))

(defun make-v5-uuid-indexed (namespace name)
  (declare (type string name)
           (type unicly:unique-universal-identifier namespace)
           (inline unicly::uuid-digest-uuid-instance
                   digested-v5-uuid-indexed)
           (optimize (speed 3)))
  (the (values uuid-indexable-v5 &optional)
    (digested-v5-uuid-indexed
     (the unicly::uuid-byte-array-20
       (unicly::uuid-digest-uuid-instance 5 namespace name)))))

(defparameter *tt--indexed*
  (make-v5-uuid-indexed unicly:*uuid-namespace-dns* "bubba"))
; => *TT--INDEXED*

(bit-vector-of-uuid *TT--INDEXED*)
;=> #*1110111010100001000100000101111000...

(integer-128-of-uuid *TT--INDEXED*)
;=> 317192554773903544674993329975922389959

(unicly:unique-universal-identifier-p *tt--indexed*)
;=> T

(unicly:uuid-princ-to-string *tt--indexed*)
;=> "eea1105e-3681-5117-99b6-7b2b5fe1f3c7"

(unicly::uuid-to-byte-array *tt--indexed*)
;=> #(238 161 16 94 54 129 81 23 153 182 123 43 95 225 243 199)

(unicly::uuid-from-bit-vector (bit-vector-of-uuid *tt--indexed*))
;=> eea1105e-3681-5117-99b6-7b2b5fe1f3c7

(describe *TT--INDEXED*)
; => eea1105e-3681-5117-99b6-7b2b5fe1f3c7
;   [standard-object]
;
; Slots with :INSTANCE allocation:
;   { ... %uuid_<FOO> slots elided ... }
;   BIT-VECTOR                   =
#*11101110101000010001000001011110001101101000000101010001000101111001..
;   INTEGER-128                  = 317192554773903544674993329975922389959


From monkey at sandpframing.com  Mon Sep 12 04:38:51 2011
From: monkey at sandpframing.com (MON KEY)
Date: Mon, 12 Sep 2011 00:38:51 -0400
Subject: [vivace-graph-devel] Rucksack Btree number-indexes on UUID
 integer-128 representations -- a sketch
In-Reply-To: <CACU6nW1r0vNCdiAVrZaaGJdOdncWxV+x5dckJ2H4mADtt=ZTjw@mail.gmail.com>
References: <CACU6nW1r0vNCdiAVrZaaGJdOdncWxV+x5dckJ2H4mADtt=ZTjw@mail.gmail.com>
Message-ID: <CACU6nW3sLeTy8V-VJ+tHH0OJhjpZO+bhTrDQzbD-MDEBKStEKQ@mail.gmail.com>

Some timings using the procedures in unicly/unicly-timings.lisp

First we populate an array of 1mil elts each a random length string of
1,36 randomly chosen UTF-8 characters.

Next we get a baseline timing for unicly::make-v5-uuid  by iterating
over that array.
 Our baseline shows that unicly::make-v5-uuid can generate approx.
79611.50 v5 UUIDs per second

Following that we over iterate the same array with
`make-v5-uuid-indexed' as per the definition from the previous
message.
This measurement indicates we can generate approx. 28376.04 v5 UUIDs
per second where each of these UUID objects contains a cache of its
128-bit integer representation as well as an equivalent bit-vector.
IOW, the speed cost of minting a v5 UUID with the two additional
cached slots is approx 2.8x the un-cached version.
This doesn't seem an insurmountable cost given that we would no longer
have to worry about performing a conversion when walking the Btree for
lookup/insertion.

Timings follow:
On SBCL 1.051 1.0.51.28-42fbc5e x86-32 on Linux using recent Unicly from Github.

(loop
   for x from 0 below 1000000
   do (setf (aref *tt--rnd* x) (make-random-string 36)))

(generic-gc)
(time
 (loop
    for x across *tt--rnd*
    do (unicly::make-v5-uuid unicly::*uuid-namespace-dns* x)))
; Evaluation took:
;   12.561 seconds of real time
;   12.549092 seconds of total run time (12.526096 user, 0.022996 system)
;   [ Run times consist of 0.837 seconds GC time, and 11.713 seconds
non-GC time. ]
;   99.90% CPU
;   20,882,950,670 processor cycles
;   961,227,776 bytes consed

(format nil "~,2F v5 UUIDs per second" (/ 1000000  12.561))
"79611.50 v5 UUIDs per second"

(generic-gc)
(time
 (loop
    for x across *tt--rnd*
    do (make-v5-uuid-indexed unicly::*uuid-namespace-dns* x)))
; Evaluation took:
;   35.241 seconds of real time
;   35.243642 seconds of total run time (35.156655 user, 0.086987 system)
;   [ Run times consist of 3.851 seconds GC time, and 31.393 seconds
non-GC time. ]
;   100.01% CPU
;   58,586,949,330 processor cycles
;   4,101,235,976 bytes consed

(format nil "~,2F v5 UUIDs per second" (/ 1000000 35.241))
;=> "28376.04 v5 UUIDs per second"


From monkey at sandpframing.com  Mon Sep 12 22:56:48 2011
From: monkey at sandpframing.com (MON KEY)
Date: Mon, 12 Sep 2011 18:56:48 -0400
Subject: [vivace-graph-devel] Rucksack Btree number-indexes on UUID
 integer-128 representations -- a sketch
In-Reply-To: <CACU6nW3sLeTy8V-VJ+tHH0OJhjpZO+bhTrDQzbD-MDEBKStEKQ@mail.gmail.com>
References: <CACU6nW1r0vNCdiAVrZaaGJdOdncWxV+x5dckJ2H4mADtt=ZTjw@mail.gmail.com>
	<CACU6nW3sLeTy8V-VJ+tHH0OJhjpZO+bhTrDQzbD-MDEBKStEKQ@mail.gmail.com>
Message-ID: <CACU6nW0YPoAERiAM8mHF6=_on87xJ-UjGE1iVi+-8Xxm=Ae0Gw@mail.gmail.com>

I've made some more timings to initially gauge how long it might take
to do a "key-<" lookup on a v4 UUID bit-vector.

Also, i've included some examples which indicate where fanout in a
Btree of v4-uuids should occur.

Following parameter is used by function `find-first-bit' defined
below:

(defparameter *bit-vector-bit-table* (make-array 129))

Add 127 bit-vectors each with one bit set at an offset in range 0,127:

(flet ((set-one-bit (idx)
         (let ((bv (make-array 128 :element-type 'bit)))
           (setf (sbit bv idx) 1)
           bv)))
  (loop
     for idx from 0 below 128
     do (setf (aref *bit-vector-bit-table* idx)
              (set-one-bit idx))
     finally (setf (aref *bit-vector-bit-table* 128)
(unicly::uuid-bit-vector-128-zeroed))))

;; Find the first non-zero bit in UUID-BV.  This function is slower
;; than `find-first-bit-by-bit' but is more readily adaptable for use
;; with Btrees where there is a requirement to descend nodes.  Walk
;; each bit-vector bv-N in *bit-vector-bit-table* taking the
;; `cl:bit-and' of bv-N and UUID-BV.  We put the return value of each
;; `cl:bit-and' on the local loop var maybe-not-zeroed.  As soon as
;; maybe-not-zeroed is not equal the equivalent of
;; (unicly::uuid-bit-vector-128-zeroed) we return the index of the
non-zero bit found.
(defun find-first-bit (uuid-bv)
  (loop
     with always-zeroed = (aref *bit-vector-bit-table* 128)
     with maybe-not-zeroed = (unicly::uuid-bit-vector-128-zeroed)
     for bv across *bit-vector-bit-table*
     for cnt from 0 below 128
     do (bit-and bv uuid-bv maybe-not-zeroed)
     until (not (equal always-zeroed maybe-not-zeroed))
     finally (return cnt)))

;; find the first non-zero bit in uuid-bv and return its position.
(defun find-first-bit-by-bit (uuid-bv)
  (loop
     for x across uuid-bv
     for y from 0 below 128
     when (plusp (sbit uuid-bv y))
     do (return y)))

This example shows a toy "bv-key->" in which we test only the MSB bit
(the 0 bit) for two v4 UUID bit-vectors. Note, we still have +/- 127
more bits to traverse before we might find which node the bit-vector
belongs to:

(loop
   repeat 100
   collect  (> (find-first-bit-by-bit (unicly::uuid-to-bit-vector
(unicly::make-v4-uuid)))
               (find-first-bit-by-bit (unicly::uuid-to-bit-vector
(unicly::make-v4-uuid)))))


;; An array of 100k elts. We use it to find the distribution of
;; first-bit non-zero bits in v4 UUIDs
(defparameter *tt--sample-bv-array* (make-array 100000))

;; Poplute the array of variable *tt--sample-bv-array* with 100k new
;; v4 uuid bit-vectors.
(defun make-new-random-uuid-table ()
  (loop
     for x from 0 below 100000
     do (setf (aref *tt--sample-bv-array* x)
              (unicly:uuid-to-bit-vector (unicly:make-v4-uuid)))))

;; Do it now.
(make-new-random-uuid-table)

;; (aref *tt--sample-bv-array* 99999)

Timing for `find-first-bit' for all 100k elts of `*tt--sample-bv-array*':

(sb-ext:gc :full t)
(time
 (loop
    for x from 0 below 100000
    do (find-first-bit (aref *tt--sample-bv-array* x))))
;;
;; Evaluation took:
;;   0.049 seconds of real time
;;   0.048993 seconds of total run time (0.041994 user, 0.006999 system)
;;   100.00% CPU
;;   147,111,712 processor cycles
;;   4,798,088 bytes consed

;; timing for `find-first-bit-by-bit' for all 100k elts of
;; `*tt--sample-bv-array*'
(sb-ext:gc :full t)
(time
 (loop
    for x from 0 below 100000
    do (find-first-bit-by-bit (aref *tt--sample-bv-array* x))))
;;
;; Evaluation took:
;;   0.009 seconds of real time
;;   0.007999 seconds of total run time (0.007999 user, 0.000000 system)
;;   88.89% CPU
;;   27,299,580 processor cycles
;;   0 bytes consed

;; Evaluating `get-random-uuid-table-distribution' should give an idea
;; of the inital fanout we might expect.
(defun get-random-uuid-table-distribution ()
  (let ((cnt-table (make-hash-table)))
    (loop
       for x from 0 below 128
       do (setf (gethash x cnt-table) 0))
    (loop
       initially (make-new-random-uuid-table)
       for x from 0 below 100000
       for y = (find-first-bit-by-bit (aref *tt--sample-bv-array* x))
       do  (incf (gethash y cnt-table))
       finally (return (loop
                          for x from 0 below 128
                          collect (cons x (gethash x cnt-table)) into idx-counts
                          finally (return (remove-if #'null
                                                     (map 'list
                                                          #'(lambda (x)
                                                              (and
(plusp (cdr x))
                                                                   x))
                                                          idx-counts))))))))
(dotimes (i 10)
  (terpri)
  (print (get-random-uuid-table-distribution)))

; ((0 . 50005) (1 . 24972) (2 . 12341) (3 . 6323) (4 . 3123) (5 . 1628)
;  (6 . 789) (7 . 420) (8 . 192) (9 . 114) (10 . 47) (11 . 27) (12 . 9)
;  (13 . 4) (14 . 2) (15 . 3) (19 . 1))
;
; ((0 . 50063) (1 . 25070) (2 . 12402) (3 . 6229) (4 . 3052) (5 . 1630)
;  (6 . 769) (7 . 384) (8 . 197) (9 . 115) (10 . 51) (11 . 18) (12 . 10)
;  (13 . 6) (14 . 3) (15 . 1))
;
; ((0 . 49870) (1 . 25112) (2 . 12435) (3 . 6268) (4 . 3187) (5 . 1550)
;  (6 . 795) (7 . 393) (8 . 189) (9 . 105) (10 . 46) (11 . 25) (12 . 16)
;  (13 . 1) (14 . 5) (16 . 3))
;
; ((0 . 49986) (1 . 24947) (2 . 12571) (3 . 6243) (4 . 3121) (5 . 1538)
;  (6 . 783) (7 . 411) (8 . 200) (9 . 101) (10 . 48) (11 . 26) (12 . 12)
;  (13 . 6) (14 . 4) (15 . 1) (16 . 1) (17 . 1))
;
; ((0 . 49920) (1 . 25063) (2 . 12631) (3 . 6290) (4 . 3060) (5 . 1472)
;  (6 . 780) (7 . 377) (8 . 193) (9 . 108) (10 . 44) (11 . 29) (12 . 16)
;  (13 . 4) (14 . 5) (15 . 5) (16 . 3))
;
; ((0 . 50068) (1 . 24900) (2 . 12508) (3 . 6245) (4 . 3150) (5 . 1585)
;  (6 . 761) (7 . 395) (8 . 194) (9 . 99) (10 . 41) (11 . 35) (12 . 13)
;  (13 . 3) (14 . 3))
;
; ((0 . 50170) (1 . 24891) (2 . 12450) (3 . 6221) (4 . 3210) (5 . 1571)
;  (6 . 726) (7 . 383) (8 . 189) (9 . 95) (10 . 51) (11 . 21) (12 . 8)
;  (13 . 5) (14 . 6) (15 . 2) (17 . 1))
;
; ((0 . 50115) (1 . 24924) (2 . 12751) (3 . 6090) (4 . 3049) (5 . 1541)
;  (6 . 769) (7 . 381) (8 . 185) (9 . 88) (10 . 54) (11 . 25) (12 . 14)
;  (13 . 9) (14 . 1) (15 . 2) (17 . 2))
;
; ((0 . 49698) (1 . 25089) (2 . 12637) (3 . 6268) (4 . 3143) (5 . 1595)
;  (6 . 753) (7 . 382) (8 . 209) (9 . 114) (10 . 59) (11 . 28) (12 . 9)
;  (13 . 9) (14 . 4) (15 . 2) (16 . 1))
;
; ((0 . 49755) (1 . 25134) (2 . 12420) (3 . 6368) (4 . 3104) (5 . 1596)
;  (6 . 802) (7 . 421) (8 . 179) (9 . 103) (10 . 51) (11 . 28) (12 . 21)
;  (13 . 12) (14 . 4) (16 . 2))

:NOTE One Thing to take into cosideration is that a Btree scheme that
frobs the UUID bit-vector might want to take care to be
unicly::uuid-version-bit-vector aware.  E.g. the output from following
example makes it pretty clear that any node branching on the value of
bit 49 is gonna always contain every v4 UUID. Note also that this is
equally true of the uuid-integer-128 representation...

After evaluating form below you should see a line of 1's at column 52
in your slime-repl:

(dotimes (i 100 (terpri))
  (terpri)
  (unicly:uuid-print-bit-vector t (unicly:make-v4-uuid)))


From danlentz at gmail.com  Tue Sep 13 14:44:31 2011
From: danlentz at gmail.com (Dan Lentz)
Date: Tue, 13 Sep 2011 10:44:31 -0400
Subject: [vivace-graph-devel] elephant
Message-ID: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>

I've been thinking about persistent index strategies, and have read through
the paper on fpb+trees, and have had a few thoughts.

The first and simplest is to make use of elephant.  Its not very exotic or
course but it would allow a model in which triples can be first class
objects, yet leverage a reasonably performant back end (bdb).  In addition,
the set-valued slots and association slots are nice abstractions on top of
which to build the rdf semantics (properties, extensions) on top of a real
clos mop.

I figured I'd shoot the idea onto the mailing list to get a feel for the
degree and nature of agreement/disagreement.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/vivace-graph-devel/attachments/20110913/0a1f4ab3/attachment.html>

From raison at chatsubo.net  Tue Sep 13 16:15:59 2011
From: raison at chatsubo.net (Kevin Raison)
Date: Tue, 13 Sep 2011 09:15:59 -0700
Subject: [vivace-graph-devel] elephant
In-Reply-To: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
Message-ID: <4E6F81BF.5060001@chatsubo.net>

Dan, I actually already tried using elephant at a very early stage in 
the development of VG.  While elephant is an excellent library (which I 
have used in many projects), it is simply too slow to be of use here. 
One of the goals of VG is to be fast and to be able to handle billions 
of triples.  Elephant slows down very quickly because of a number of 
factors, including its use of BerkeleyDB rather than a native Lisp 
back-end store, as well as its complexity.  VG does not need the level 
of complexity or abstraction that you get with elephant's indexes and 
class redefinition logic.  In our case, we are dealing with one class, 
the triple, and as such, we can be very specific about how we store and 
index it as well as how we deal with it in memory.  Standard b-trees 
simply won't efficiently handle the fanout of a large triple store;  we 
need something specifically tuned to our purpose.  I am fairly certain 
that linear hashing for triple storage combined with b-tries or fb-trees 
for indexing would do much better.  There are other graph dbs out there 
that use this strategy.  See 
http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ 
for some good discussion.

Another goal of mine is to develop a native Lisp back-end that projects 
like elephant might be able to take advantage of;  not relying on 
external, non-Lisp libraries is a good thing, especially BerkeleyDB, 
given its terrible licensing terms (thanks, Oracle).

You mention that you had some further thoughts after reading the fractal 
pre-fetching b-trees paper;  care to share?

-Kevin


On 09/13/2011 07:44 AM, Dan Lentz wrote:
> I've been thinking about persistent index strategies, and have read
> through the paper on fpb+trees, and have had a few thoughts.
>
> The first and simplest is to make use of elephant.  Its not very exotic
> or course but it would allow a model in which triples can be first class
> objects, yet leverage a reasonably performant back end (bdb).  In
> addition, the set-valued slots and association slots are nice
> abstractions on top of which to build the rdf semantics (properties,
> extensions) on top of a real clos mop.
>
> I figured I'd shoot the idea onto the mailing list to get a feel for the
> degree and nature of agreement/disagreement.
>
>
> _______________________________________________
> vivace-graph-devel mailing list
> vivace-graph-devel at common-lisp.net
> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel


From kraison at common-lisp.net  Tue Sep 13 15:53:19 2011
From: kraison at common-lisp.net (Kevin Raison)
Date: Tue, 13 Sep 2011 08:53:19 -0700
Subject: [vivace-graph-devel] elephant
In-Reply-To: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
Message-ID: <4E6F7C6F.1090003@common-lisp.net>

Dan, I actually already tried using elephant at a very early stage in 
the development of VG.  While elephant is an excellent library (which I 
have used in many projects), it is simply too slow to be of use here. 
One of the goals of VG is to be fast and to be able to handle billions 
of triples.  Elephant slows down very quickly because of a number of 
factors, including its use of BerkeleyDB rather than a native Lisp 
back-end store, as well as its complexity.  VG does not need the level 
of complexity or abstraction that you get with elephant's indexes and 
class redefinition logic.  In our case, we are dealing with one class, 
the triple, and as such, we can be very specific about how we store and 
index it as well as how we deal with it in memory.  Standard b-trees 
simply won't efficiently handle the fanout of a large triple store;  we 
need something specifically tuned to our purpose.  I am fairly certain 
that linear hashing for triple storage combined with b-tries or fb-trees 
for indexing would do much better.  There are other graph dbs out there 
that use this strategy.  See 
http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ 
for some good discussion.

Another goal of mine is to develop a native Lisp back-end that projects 
like elephant might be able to take advantage of;  not relying on 
external, non-Lisp libraries is a good thing, especially BerkeleyDB, 
given its terrible licensing terms (thanks, Oracle).

You mention that you had some further thoughts after reading the fractal 
pre-fetching b-trees paper;  care to share?

-Kevin


On 09/13/2011 07:44 AM, Dan Lentz wrote:
> I've been thinking about persistent index strategies, and have read
> through the paper on fpb+trees, and have had a few thoughts.
>
> The first and simplest is to make use of elephant.  Its not very exotic
> or course but it would allow a model in which triples can be first class
> objects, yet leverage a reasonably performant back end (bdb).  In
> addition, the set-valued slots and association slots are nice
> abstractions on top of which to build the rdf semantics (properties,
> extensions) on top of a real clos mop.
>
> I figured I'd shoot the idea onto the mailing list to get a feel for the
> degree and nature of agreement/disagreement.
>
>
> _______________________________________________
> vivace-graph-devel mailing list
> vivace-graph-devel at common-lisp.net
> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel


From monkey at sandpframing.com  Wed Sep 14 02:36:03 2011
From: monkey at sandpframing.com (MON KEY)
Date: Tue, 13 Sep 2011 22:36:03 -0400
Subject: [vivace-graph-devel] elephant
In-Reply-To: <4E6F81BF.5060001@chatsubo.net>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
	<4E6F81BF.5060001@chatsubo.net>
Message-ID: <CACU6nW1Cnp4XZ4vHkYTyY7Yg2jyBsDcP5+7iN2CQLUtJkA0RWg@mail.gmail.com>

Hi Dan & Kevin,

On Tue, Sep 13, 2011 at 12:15 PM, Kevin Raison <raison at chatsubo.net> wrote:
> Dan, I actually already tried using elephant at a very early stage in the
> development of VG.  While elephant is an excellent library (which I have
> used in many projects), it is simply too slow to be of use here. One of the
> goals of VG is to be fast and to be able to handle billions of triples.

What kind of value are you shooting for here?
E.g. What do you think is a reasonable value for X?

 (funcall #'(lambda (x) (format nil "~R triples?" (* 1000 1000000 x))) ???)

FWIW I would suggest that X need not be anything larger than 8 :)

and that realistically this is a more reasonable upper bounds:

 (format nil "~R triples" #x7fffffff)

`cl:sxhash' return value HASH-CODE is specified as a non-negative fixnum:

 ,----
 | The HASH-CODE is intended for hashing.  This places no verifiable
 | constraint on a conforming implementation, but the intent is that
 | an implementation should make a good-faith effort to produce
 | HASH-CODES that are well distributed within the range of
 | non-negative fixnums.
 `----

So, assuming the underlying Lisp does make a good-faith effort to
distribute hash-codes across the range of 0,most-positive-fixnum if we
wanted one gimongous linearly-hashed table of ~8 billion triple thingies
wouldn't we need _at least_ a theoretical

   (format nil "~R fixnums" #X3FFFFFFFF)

for indexing such a gimongoid?

And if so, would we not run out of fixnums at:

 (format nil "somehwere around fixnum ~R" #x1FFFFFFF)

on an x86-32 SBCl?

In any event, assuming fixnums are the lightest possible serializable
key in a hash-table and that the range of thoese keys has a
performative upper bounds of #x1FFFFFFF on x86-32 SBCL and #x3FFFFFFFF
on x86-64 we're likely to require at least ~3 bytes of diskspace per
32bit key and ~7 bytes per 64bit key.

  (format nil
	"Noting that:~% ~R triples~% is a ~A bit number"
	#x3FFFFFFFF (integer-length (1- (ash 1 64))))

So serializing just the fixnum integer keys of
   (format nil "~R triples" #x7fffffff)
is liable to require

(format nil "somewhere less than ~D GB on a 32bit machinine"
	 (nth-value 0 (round (* #x7fffffff 3) (* 1024 1024 1024))))
	
(format nil "somewhere less than ~D GB on a 64bit machinine"
	 (nth-value 0 (round (* #x7fffffff 7) (* 1024 1024 1024))))


(format nil "Somewhere slighly less than ~D GB on a 32bit machine~%~
	     and somewhere sligthly less than ~D GB on a 64bit machine"
	 (nth-value 0 (round (* #x7fffffff 3) (* 1024 1024 1024)))
	 (nth-value 0 (round (* #x7fffffff 7) (* 1024 1024 1024))))

Please correct me if my math is out of whack?

Also, assuming these are reasonable amounts for serializing just the
hash-table fixnum integer keys of:
  (format nil "~R triples" #x7fffffff)

Is it not reasonable to assume that the above values might serve as a
good guidepost for what we might expect of an in-memory footprint of
the data-structures holding those:
 (format nil "~R triples" #x7fffffff)

Even with MMAPing there is there not still some significant overhead
associated with deserializing the MMAPPed data to something lispy?


>  Elephant slows down very quickly because of a number of factors, including
> its use of BerkeleyDB rather than a native Lisp back-end store, as well as

Not to mention there are some licensing issues surrounding BDB...
 ... FSVO "some" ...

> its complexity.  VG does not need the level of complexity or abstraction
> that you get with elephant's indexes and class redefinition logic.

Naively, I would expect the MOP stuff to be a factor. Is it?

> In our
> case, we are dealing with one class, the triple, and as such, we can be very
> specific about how we store and index it as well as how we deal with it in
> memory.  Standard b-trees simply won't efficiently handle the fanout of a
> large triple store;  we need something specifically tuned to our purpose.

Why not?

I'm under the impression that many of the big Linux distros will soon
release with Btrfs as the default file-system... and either Ted Tso is
sandbagging for google with his recent endoresement of Btrfs over ext4
or there must be at least some utility for B+trees :)

Also what is a "standard" b-tree?

FTR I confuse myself when referencing b-trees :)
It might be helpful to establish some prototocal for the
datastructures in question -- a wiki-link would suffice.

> am fairly certain that linear hashing for triple storage combined with
> b-tries or fb-trees for indexing would do much better.

While i'm not convinced that b-tries are TRT, I certainly agree that
linear hashing is (at least for in memory data)!

FWIW apropos all the recent UUID bit-vector timing junk i've posted here
recently I figured it might be prudent to take some measurements using
just 128-bit integers for indexing...

I was pleasantly surprised to find that even with the relatively big
128bit bignum's SBCL hash-table lookup is pretty damn snappy with
a large(ish) number of key/value pairs in the range of 500k-1mil

Indeed, once optimizations around the allocation of the unerlying hash-table
are made by massaging the value to make-hash-table's :SIZE keyword it
gets even better!

> There are other graph dbs out there that use this strategy.  See
> http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/
> for some good discussion.

One of the bullet-point API implementation details i found ineresting

 "Items are identified by a string unique
  identifier which maps to an integer index."

Is implying the effective equivalent of:
 (assoc  "stringy-id" '(("stringy-id" . 123456789)) :test 'equal)

Or the inverse:
   (assoc 123456789 '((123456789 . "stringy-id")))

??

,----
| This is another point that we break from typical database design
| theory. In a typical database you?d look up a record by checking for
| it in a B-tree and then go off to find the data pointer for its
| record, which might have a continuation record at the end that you
| have to look more stuff up in ? and so on. Our lookups are constant
| time. We hash the string identifier to get an Index and then use that
| Index to find the appropriate Offsets for its data. These vectors are
| sequential on disk, rather than using continuation tables or something
| similar, which makes constant time lookups possible.
`---- Section "File-based Data Structures: Stack, Vector, Linear Hash"

which doesn't sound entirely unlike the toy example of self resolving
string-entity's using `initialize-context/``get-entity-in-context'
which i posted here the other day:

 http://lists.common-lisp.net/pipermail/vivace-graph-devel/2011-September/000008.html

>
> Another goal of mine is to develop a native Lisp back-end that projects like
> elephant might be able to take advantage of;
I would like to interject that while vivace-graph-v2 may not be
targetted as a full blown Persistent Object Store it does have the
potential to re-think some of the cool functionality of Statice using
SPROG instead of an object hierarchy:

http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/statice.html

> not relying on external,
> non-Lisp libraries is a good thing, especially BerkeleyDB, given its
> terrible licensing terms (thanks, Oracle).

Its not just Oracle that have left BDB license in shambles...

Regardless, I personally have a strong desire to keep integration with
external (read non-lispy) tools to a minimum.


From raison at chatsubo.net  Wed Sep 14 05:15:56 2011
From: raison at chatsubo.net (Kevin Raison)
Date: Tue, 13 Sep 2011 22:15:56 -0700
Subject: [vivace-graph-devel] elephant
In-Reply-To: <4E6F81BF.5060001@chatsubo.net>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
	<4E6F81BF.5060001@chatsubo.net>
Message-ID: <4E70388C.7060408@chatsubo.net>

A good paper on linear hashing to disk is attached.  I will respond to 
Mon key's comments after some sleep...

On 09/13/2011 09:15 AM, Kevin Raison wrote:
> Dan, I actually already tried using elephant at a very early stage in
> the development of VG. While elephant is an excellent library (which I
> have used in many projects), it is simply too slow to be of use here.
> One of the goals of VG is to be fast and to be able to handle billions
> of triples. Elephant slows down very quickly because of a number of
> factors, including its use of BerkeleyDB rather than a native Lisp
> back-end store, as well as its complexity. VG does not need the level of
> complexity or abstraction that you get with elephant's indexes and class
> redefinition logic. In our case, we are dealing with one class, the
> triple, and as such, we can be very specific about how we store and
> index it as well as how we deal with it in memory. Standard b-trees
> simply won't efficiently handle the fanout of a large triple store; we
> need something specifically tuned to our purpose. I am fairly certain
> that linear hashing for triple storage combined with b-tries or fb-trees
> for indexing would do much better. There are other graph dbs out there
> that use this strategy. See
> http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/
> for some good discussion.
>
> Another goal of mine is to develop a native Lisp back-end that projects
> like elephant might be able to take advantage of; not relying on
> external, non-Lisp libraries is a good thing, especially BerkeleyDB,
> given its terrible licensing terms (thanks, Oracle).
>
> You mention that you had some further thoughts after reading the fractal
> pre-fetching b-trees paper; care to share?
>
> -Kevin
>
>
> On 09/13/2011 07:44 AM, Dan Lentz wrote:
>> I've been thinking about persistent index strategies, and have read
>> through the paper on fpb+trees, and have had a few thoughts.
>>
>> The first and simplest is to make use of elephant. Its not very exotic
>> or course but it would allow a model in which triples can be first class
>> objects, yet leverage a reasonably performant back end (bdb). In
>> addition, the set-valued slots and association slots are nice
>> abstractions on top of which to build the rdf semantics (properties,
>> extensions) on top of a real clos mop.
>>
>> I figured I'd shoot the idea onto the mailing list to get a feel for the
>> degree and nature of agreement/disagreement.
>>
>>
>> _______________________________________________
>> vivace-graph-devel mailing list
>> vivace-graph-devel at common-lisp.net
>> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel
>
> _______________________________________________
> vivace-graph-devel mailing list
> vivace-graph-devel at common-lisp.net
> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: e_ds_linearhashing.pdf
Type: application/pdf
Size: 108658 bytes
Desc: not available
URL: <https://mailman.common-lisp.net/pipermail/vivace-graph-devel/attachments/20110913/f708fe7a/attachment.pdf>

From raison at chatsubo.net  Wed Sep 14 05:30:53 2011
From: raison at chatsubo.net (Kevin Raison)
Date: Tue, 13 Sep 2011 22:30:53 -0700
Subject: [vivace-graph-devel] elephant
In-Reply-To: <4E70388C.7060408@chatsubo.net>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
	<4E6F81BF.5060001@chatsubo.net> <4E70388C.7060408@chatsubo.net>
Message-ID: <4E703C0D.3090306@chatsubo.net>

And one more paper on linear hashing.

On 09/13/2011 10:15 PM, Kevin Raison wrote:
> A good paper on linear hashing to disk is attached. I will respond to
> Mon key's comments after some sleep...
>
> On 09/13/2011 09:15 AM, Kevin Raison wrote:
>> Dan, I actually already tried using elephant at a very early stage in
>> the development of VG. While elephant is an excellent library (which I
>> have used in many projects), it is simply too slow to be of use here.
>> One of the goals of VG is to be fast and to be able to handle billions
>> of triples. Elephant slows down very quickly because of a number of
>> factors, including its use of BerkeleyDB rather than a native Lisp
>> back-end store, as well as its complexity. VG does not need the level of
>> complexity or abstraction that you get with elephant's indexes and class
>> redefinition logic. In our case, we are dealing with one class, the
>> triple, and as such, we can be very specific about how we store and
>> index it as well as how we deal with it in memory. Standard b-trees
>> simply won't efficiently handle the fanout of a large triple store; we
>> need something specifically tuned to our purpose. I am fairly certain
>> that linear hashing for triple storage combined with b-tries or fb-trees
>> for indexing would do much better. There are other graph dbs out there
>> that use this strategy. See
>> http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/
>>
>> for some good discussion.
>>
>> Another goal of mine is to develop a native Lisp back-end that projects
>> like elephant might be able to take advantage of; not relying on
>> external, non-Lisp libraries is a good thing, especially BerkeleyDB,
>> given its terrible licensing terms (thanks, Oracle).
>>
>> You mention that you had some further thoughts after reading the fractal
>> pre-fetching b-trees paper; care to share?
>>
>> -Kevin
>>
>>
>> On 09/13/2011 07:44 AM, Dan Lentz wrote:
>>> I've been thinking about persistent index strategies, and have read
>>> through the paper on fpb+trees, and have had a few thoughts.
>>>
>>> The first and simplest is to make use of elephant. Its not very exotic
>>> or course but it would allow a model in which triples can be first class
>>> objects, yet leverage a reasonably performant back end (bdb). In
>>> addition, the set-valued slots and association slots are nice
>>> abstractions on top of which to build the rdf semantics (properties,
>>> extensions) on top of a real clos mop.
>>>
>>> I figured I'd shoot the idea onto the mailing list to get a feel for the
>>> degree and nature of agreement/disagreement.
>>>
>>>
>>> _______________________________________________
>>> vivace-graph-devel mailing list
>>> vivace-graph-devel at common-lisp.net
>>> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel
>>
>> _______________________________________________
>> vivace-graph-devel mailing list
>> vivace-graph-devel at common-lisp.net
>> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel
>
>
> _______________________________________________
> vivace-graph-devel mailing list
> vivace-graph-devel at common-lisp.net
> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel
-------------- next part --------------
A non-text attachment was scrubbed...
Name: p195-ellis.pdf
Type: application/pdf
Size: 1883801 bytes
Desc: not available
URL: <https://mailman.common-lisp.net/pipermail/vivace-graph-devel/attachments/20110913/06d3e76a/attachment.pdf>

From monkey at sandpframing.com  Wed Sep 14 15:09:06 2011
From: monkey at sandpframing.com (MON KEY)
Date: Wed, 14 Sep 2011 11:09:06 -0400
Subject: [vivace-graph-devel] elephant
In-Reply-To: <4E703C0D.3090306@chatsubo.net>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
	<4E6F81BF.5060001@chatsubo.net> <4E70388C.7060408@chatsubo.net>
	<4E703C0D.3090306@chatsubo.net>
Message-ID: <CACU6nW1qhAt1BULew6wScVYKgos4VwQcLMwn8MeKQiFQ0fR2bg@mail.gmail.com>

Hi Kevin,
Thanks for the links to the papers.
I'm reviewing and finding them quite informative!
Looks like I was really abusing the term "linear hashing" :(

--
/s_P\


From danlentz at gmail.com  Wed Sep 14 15:25:41 2011
From: danlentz at gmail.com (Dan Lentz)
Date: Wed, 14 Sep 2011 11:25:41 -0400
Subject: [vivace-graph-devel] nodes, fixnums/upper bounds,
	and multi-constituent indices
Message-ID: <CANco9BtuYj3NEeHdEs1wyF=msmKquLW87BhKCux+_EBKGkHhqw@mail.gmail.com>

I am still reading though all the homework recommended in recent posts :)
 Really good stuff.  I hope my questions are not a distraction from the
important topics at hand but just contribute toward general discussion and
(at least my) understanding of the project, its goals, how I can utilize VG
and perhaps, in some way, try to contribute to the effort, if possible.

part 1

Another topic I have been looking at related to the indexing and uuid's is
the representation (reification?) of nodes, or  lack thereof.  One
difference in vivace graph versus other tstores I've played with is the
ability to reference nodes as first class "things".  This is called a "node"
in wilbur, and is represented by a simple object composed of the canonical
identifier (uri-namestring) and a flag to indicate "resolution", which, for
wilbur, indicates identification to a short/long namespace mapping, but I
think the concept can be extended to also perhaps refer to hashing or other
deferrable operations.   In the Directed Edge model, nodes are apparently
considered "Items" and have a somewhat richer archetype.

In VG, this is not the case?  Triples are (currently) represented by time
based uuid as previously discussed, and nodes themselves are not hashed and
indexed.  Maybe this is going to change naturally in the course of moving to
v5 uuid?

part 2

This sort of blends into another indexing-model question, related to the
current model which is based on a hierarchical index structure?  Couldn't
additional speed be achieved though multi-constituent indexing?  IE and SP
index, PO, index etc in which multiple nodes of a single triple are hashed
in the aggregate to allow for direct lookup.  This would of course decrease
the upper bounds on the number of triples previously discussed if housed in
a single-rooted index structure, as there would be (eventually) collisions
between these incongruent indexing schemes.  So maybe a multi-rooted index
strategy is something that should be considered and incorporated early on.
 I think this is already partially implemented as spogi, gsopi, etc, but is
still "single-constituent" hierarchical?  As a concrete example -- in case
my question has been as clear as mud :) --  i'd cite the
cassandra-spoc-index-mediator of de.setf.resource, which leverages
multi-constituent indexes extensively.

Apologies (as usual) if I am missing something obvious or distracting from
more useful conversation.

Dan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/vivace-graph-devel/attachments/20110914/fa32b94b/attachment.html>

From raison at chatsubo.net  Wed Sep 14 20:01:29 2011
From: raison at chatsubo.net (Kevin Raison)
Date: Wed, 14 Sep 2011 13:01:29 -0700
Subject: [vivace-graph-devel] elephant
In-Reply-To: <CACU6nW1Cnp4XZ4vHkYTyY7Yg2jyBsDcP5+7iN2CQLUtJkA0RWg@mail.gmail.com>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
	<4E6F81BF.5060001@chatsubo.net>
	<CACU6nW1Cnp4XZ4vHkYTyY7Yg2jyBsDcP5+7iN2CQLUtJkA0RWg@mail.gmail.com>
Message-ID: <4E710819.6090107@chatsubo.net>

Comments inline below.

> Even with MMAPing there is there not still some significant overhead
> associated with deserializing the MMAPPed data to something lispy?

Yes, for persistence, this is unavoidable;  the solution is heavy 
caching and interning of strings (VG already does this).  I started work 
on some code a long time ago that worked like this:

One mmap'ed linear hash file per graph, with the triple-id as hash key. 
  The value slot is an offset (integer) into another mmap'ed file (or 
files) which is the actual triple storage area.  Because triples are 
never truly deleted, only a simple memory allocator is needed for the 
triple storage file (append only).  For indices, use some b-tree variant 
(possibly start with cl-btree: http://www.cliki.net/cl-btree or a 
b-trie: http://www.naskitis.com/naskitis-vldbj09.pdf) that maps keys to 
triple-ids.  So a non-cached lookup via triple-id would hit the mmap'ed 
hash table, get an offset into the triple storage area, deserialize the 
triple into the cache and return the struct (or vector if we want to be 
really efficient).  A search via an index would return a triple-id, 
which would hit the cache and return the triple or pass through to the 
hash table and repeat the deserialization process described above. 
AllegroGraph will use up as much memory as is available for caching, and 
with good reason;  the more triples we keep in memory, the less 
slow-down we get.  We might even have two layers of cache:  cache 
triples themselves, as well as queries that map to offsets in the data 
file or to triple-ids.  Also, since the hash table will be fairly small 
(integers -> integers), loading the whole thing into memory should be 
possible.

>> its complexity.  VG does not need the level of complexity or abstraction
>> that you get with elephant's indexes and class redefinition logic.
>
> Naively, I would expect the MOP stuff to be a factor. Is it?

Yes, who needs the MOP in this circumstance?  I would prefer triples to 
be stored in memory as vectors (using the args to defstruct to force the 
use of vector storage) for the sake of efficiency.

>> In our
>> case, we are dealing with one class, the triple, and as such, we can be very
>> specific about how we store and index it as well as how we deal with it in
>> memory.  Standard b-trees simply won't efficiently handle the fanout of a
>> large triple store;  we need something specifically tuned to our purpose.
>
> Why not?

Think about what a triple really is:  each S, P or O is a completely 
unique thing in the database.  For the triple (Kevin likes cats), three 
symbols are created: 'Kevin, 'likes, and 'cats.  Another triple, say 
(Kevin likes dogs) only creates one new symbol: 'dogs.  There are not 
now two 'Kevin's in the db, but two triples that reference that one 
symbol.  When indexing something like this in a btree, each symbol is a 
node in the tree, and each level of the btree would correspond to a slot 
in the triple;  for example, to index in order of subject, predicate, 
object, the tree would be structured as

      S
     / \
    P   P
   / \   \
  O   O   O

Because nodes don't repeat and are atomic symbols, traversing the tree 
would be a linear search at each level.  In a b-trie 
(http://www.naskitis.com/naskitis-vldbj09.pdf), you could effectively 
string the S, P and O together and create more reasonably branching 
search paths.  For example, to index the two triples mentioned above 
with a third, (Kevin loves pizza), you would have a tree like:

                             K
                            /
                           E
                          /
                         V
                        /
                       I
                      /
                     N
                    /
                   L
                  / \
                 I   O
                /     \
               K       V
              /         \
             E           E
            /             \
           S               S
          / \               \
      CATS   DOGS            PIZZA
        /     \               \
       ID     ID               ID

This would also allow for substring matching in a very simple way.  Read 
the paper for more details and a comparison to B+ trees.

>> Another goal of mine is to develop a native Lisp back-end that projects like
>> elephant might be able to take advantage of;
> I would like to interject that while vivace-graph-v2 may not be
> targetted as a full blown Persistent Object Store it does have the
> potential to re-think some of the cool functionality of Statice using
> SPROG instead of an object hierarchy:
>
> http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/statice.html

I don't have time to look at this right now;  will revisit and comment 
later.

>> not relying on external,
>> non-Lisp libraries is a good thing, especially BerkeleyDB, given its
>> terrible licensing terms (thanks, Oracle).
>
> Its not just Oracle that have left BDB license in shambles...
>
> Regardless, I personally have a strong desire to keep integration with
> external (read non-lispy) tools to a minimum.

YES!

-K


From monkey at sandpframing.com  Wed Sep 14 20:35:40 2011
From: monkey at sandpframing.com (MON KEY)
Date: Wed, 14 Sep 2011 16:35:40 -0400
Subject: [vivace-graph-devel] nodes, fixnums/upper bounds,
 and multi-constituent indices
In-Reply-To: <CANco9BtuYj3NEeHdEs1wyF=msmKquLW87BhKCux+_EBKGkHhqw@mail.gmail.com>
References: <CANco9BtuYj3NEeHdEs1wyF=msmKquLW87BhKCux+_EBKGkHhqw@mail.gmail.com>
Message-ID: <CACU6nW3swvmgVaOqGko9=04Dbkz+YK2TDqS6+EPz_EH6of4oAQ@mail.gmail.com>

Hi Dan,

On Wed, Sep 14, 2011 at 11:25 AM, Dan Lentz <danlentz at gmail.com> wrote:
> I am still reading though all the homework recommended in recent posts :)

Me too :)

> Really good stuff.

I'm learning alot as well.

> I hope my questions are not a distraction from the important topics
> at hand but just contribute toward general discussion and (at least
> my) understanding of the project, its goals, how I can utilize VG
> and perhaps, in some way, try to contribute to the effort, if
> possible.

I share many of the same questions and appreciate not having to ask
them myself. Also, I've found it extremely useful to have recourse to
the dialogues, discussions, questions, and answers on other archived
common-lisp.net mailing lists esp. for projects with specs/API which
were finalized years ago and the designers have since moved on,
stopped active development, or are in maintenance only mode (Rucksack
comes immediately to mind).

> Another topic I have been looking at related to the indexing and
> uuid's is the representation (reification?) of nodes, or lack
> thereof.
>
> One difference in vivace graph versus other tstores I've played with
> is the ability to reference nodes as first class "things".

Maybe because they resolve to first class Lisp objects and don't
resort to mediating the inferior objects spat out by lesser non-lispy
sources :)

No doubt this will eventually change once the VG2
transaction/persistence/indexing stuff is better established
(hopefully sooner than later).

> Another topic I have been looking at related to the indexing and uuid's is
> the representation (reification?) of nodes, or lack thereof.

VG2 is Kevin's baby and he's the boss, so I hope i'm not stepping on
toes by interjecting.

My impression is that implementations approaching SPOG triples tend to
have some hard-wired implicit assumptions about the operational
semantics of SPOG and that these assumptions are likely to yield a
relatively constant subset of basic operations over the triples
regardless of implementation. Which is to say, the basic idiom for how
one might perform these operations is established (independent of
whether VG currently implements them or not).

Where VG2 might differ or deviate from other implementations is not w/r/t
SPOG but rather SPOGI e.g. triple-id (and by proxy triple-indexes).

> In VG, this is not the case?  Triples are (currently) represented by time
> based uuid as previously discussed, and nodes themselves are not hashed and
> indexed.  Maybe this is going to change naturally in the course of moving to
> v5 uuid?

IHMO it is not a given that a change to a namespacing UUID would
necessarily change the existing VG2 assumptions.

The role of UUIDs (potential and current) in VG2 is multi-faceted:

 v1 UUIDs are slow and if you don't require their time-stamping then a
 v4 UUID is a better solution if all that is really required is an
 anonymous but reasonably unique ID.

 There is no immediate gain to be had by using a namespaced (v3 or v5)
 UUID instead of an anonymous v4 UUID. In fact, there would be a loss
 in performance b/c there is more overhead associated with the minting
 of v3/v5 UUIDs.

 If you assume that any SPOGI implementation must concern itself with
 "namespacing" then there _may_ be some gain in using v3/v5 UUIDs
 instead of v4 UUIDs. Whether this is the case depends on how the
 system implements:

  - triple-id
    Whether the base identity is a string, integer, class-instance, etc.

  - triple-id-resolution
    How the _base_ representation of triple identity resolves to
    intermediate and higher-level representations

  - triple-indexing
    How base triple identity is indexed and how the indexed identities
    are resolved with their intermediate and higher-level
    representations

  - triple-persistence
    Whether the system can/should preserve state across sessions.

    If preserving state is a goal then the degree to which the
    data-structures employed for triple-indexing are in-memory bound
    or require disk-i/o becomes a factor.

    If the system can remain performant with only an in-memory
    footprint then the majority of persistence issues are moot.

  - triple-performance

    What is a reasonable upper bounds on the number of triples the
    system should expect to handle?

    Should system handle networked/distributed/concurrent access?

    Wow will implementations of networked/distributed/concurrent
    triple access scale.

Obv. there are interdependencies among the set of considerations
outlined above.

>part 2
>
> This sort of blends into another indexing-model question, related to
> the current model which is based on a hierarchical index structure?
> Couldn't additional speed be achieved though multi-constituent
> indexing?  IE and SP index, PO, index etc in which multiple nodes of
> a single triple are hashed in the aggregate to allow for direct
> lookup.

Having spent some more time looking at the linear-hashing papers Kevin
provided I'm have trouble see this as an either/or situation,
e.g. underneath its all gonna eventually wind-up as hash-tables
arrays, integers and de-referenced bucket/node/leaf/offset pointers :)

I'm interested to learn from Kevin how much of the linear-hashing
scheme he believes is already "built-in" to the existing VG-2
code-base.

In particular, if most of the footwork for the linear-hashing work is
already in place?

And, if not whether there is some drop-in data-structure capable of
implementing the linear-hashing scheme he envisions.

And, if not what does he anticipate is required to implement a
functional linear-hashing scheme as he envisions.

> As a concrete example -- in case my question has been as clear as
> mud :) -- i'd cite the cassandra-spoc-index-mediator of
> de.setf.resource, which leverages multi-constituent indexes
> extensively.

My impression is that de.setf.resource has taken his approach b/c it
is in large part a meta-library for CLOS<->RDF compatibility and the
underlying constraints required to accommodate RDF require it.

My read on quoted section below is that Anderson's quotation marks
around "open-world" are meant as a mild slight on the RDF fanboys at
W3C; in so much as (of itself) RDF is not capable of reasoning in
either closed or open-world contexts. Regardless, following quote also provides
some indication of how/why Anderson has made use of UUID w/r/t
external resources, namely that the need for unique identities is as
much a function of preserving transactional context as it is one of
maintaining mappings of object identity equivalence.

,----
|  Persistence Mediation
|
|  Despite the RDF "open-world" paradigm, which requires a processing
|  mechanism accommodate unforseen data, it is imperative that a
|  repository mediator afford an application a stable projection of
|  unpredicatable content.  If a CLOS application is to rely on class
|  and generic function definitions to behave as intended, they must be
|  bound to data as it appears, `de.setf.resource` serves this goal in
|  several ways:
|  - it implements instance identity within a given mediation interface
|    according to subject URI
|
|  - it provides for automatic unique instance URI generation within a
|    transactional context
|
|  - it treats symbols, universal names, and URI as equivalent
|
|  - it accepts resources descriptions without nominal type indications,
|  reconciles them to the know class structure and admits additional
|  prototypical attributes.
|
|  + instance identity, indexing, and caching
|  Each repository mediator adopts the respective repository's interened
|  URI 'nodes' as unifying identifiers to ensure a one-to-one relation
|  between identified objects and external resources. The URI serve as
|  keys in an hash table which is used in query operations to yield
|  identical instances for equivalent URI.  The cache is not held weak,
|  as the repository's URI designator-to-node cache is itself static.
|
`---- :SOURCE de.setf.resource/resource-class.lisp

FWIW with specific consideration to the future implementation details
of VG2 I think approaching the semantics of SPOGI triples with an
RDF-centric lens can only hamstring efforts b/c:

 a) The RDF model is mostly mimicking much pre-existing Lisp based
    kb/semantic-net/AI work so layering the RDF model on top of lisp
    is not unlike using the C programming language to implement Clisp
    and then using Clisp to implement the C programming language in
    Lisp...

 b) Working in the RDF model requires constant string wrangling
    This place a significant burden on Lisp to map the brain-dead
    syntax's/semantics of curly brace derived languages over Lisp's

    IOW Lisp-2's haven't directly conflated symbols with strings since
    MacLisp days...

 c) The RDF model generally seems to place more focus on the role of
    semantics around distribution of knowledge as a resource and less
    on the role of semantics of reasoning and deduction about the
    knowledge comprising a resource.

This being said, I'm not knocking RDF, its stated goals, or its utility.
Nor do i wish to cast aspersions on the real-world concerns that
warrant an eventual focus on integrating with RDF as an attractive and
laudable goal to promote for VG2 -- if only b/c "thars gold in dem
hills..." and in general Lisp bums deserve more gold!

I just personally hope that focusing on "how RDF does it" is not an
immediate primary concern :)

> Dan

/s_P\


From monkey at sandpframing.com  Thu Sep 15 01:48:03 2011
From: monkey at sandpframing.com (MON KEY)
Date: Wed, 14 Sep 2011 21:48:03 -0400
Subject: [vivace-graph-devel] elephant
In-Reply-To: <4E710819.6090107@chatsubo.net>
References: <CANco9BsykatgegnZs5eOCL64Jj9FjDjMdvwhyrCyhunpkHM_xQ@mail.gmail.com>
	<4E6F81BF.5060001@chatsubo.net>
	<CACU6nW1Cnp4XZ4vHkYTyY7Yg2jyBsDcP5+7iN2CQLUtJkA0RWg@mail.gmail.com>
	<4E710819.6090107@chatsubo.net>
Message-ID: <CACU6nW13VqDW8jQZvmk_nbFOKQtd=Esy0jZD8NyByQ9GTq1ZMQ@mail.gmail.com>

Hi Kevin,
Thanks for your detailed response.

>> Naively, I would expect the MOP stuff to be a factor. Is it?
> Yes, who needs the MOP in this circumstance?

No clue, I'm not suggesting it is needed :)

> I would prefer triples to be stored in memory as vectors (using the args to
> defstruct to force the use of vector storage) for the sake of efficiency.

OK.
Yes you can mark them read-only too.
Also, on SBCL there is maybe some potential gains to be had
with `sb-ext:freeze-type'

> Think about what a triple really is:
>   each S, P or O is a completely unique thing in the database.

I'm not entirely comfortable with that assertion.

To the extent with which it is currently so I'm not convinced it can't be
otherwise (e.g. with namespaces).

I would think the S, P, O are only completely unique within some context:

 (let ((context 'outer))
   (let ((s "FOO") (p "IS-A") (o "BAR"))
     (print context)
     (print (list s p o))
     (let ((s "FOO") (p "IS-NOT-A") (o "BAR")
           (context  'inner)
           (new-foo  '())
           (new-bar  '()))
       (setf new-foo s  new-bar o)
       (print context)
       (flet ((test-var-mk3 (test var init)
                (apply test
                       (list var (make-array 3
                                             :element-type 'character
                                             :initial-contents init)))))
         (print `((,s ,(test-var-mk3 'eq    s new-foo)
                      ,(test-var-mk3 'eql   s new-foo)
                      ,(test-var-mk3 'equal s new-foo))
                  (,o ,(test-var-mk3 'eq    o new-bar)
                      ,(test-var-mk3 'eql   o new-bar)
                      ,(test-var-mk3 'equal o new-bar)))))
       (print (list s p o))
       (setf s nil p nil o nil)
       (print (list s p o)))
     (print context)
     (print (list s p o))
     (values)))

> Because nodes don't repeat and are atomic symbols, traversing the tree
> would be a linear search at each level.

OK. Thank you for this explanation.

I think i have been conflating the demands of triple indexing with the demands
of dereferencing triples/graphs from the persistent store.

This said, I'm not at all comfortable with the current explanation w/r/t the
semiotics around uniqueness and the atomicity of symbols vs strings, and I am
assuming that there _must_ be some level of indirection between objects denoted
by the S, P, and O and the objects which identify these denotated objects.
IOW, i'm assuming there are some over-simplifications around the whole
sign/signifier/signified thang and that this is all well trodden territory
for you and you're simply sparing us the ugly details :)

> In a b-trie (http://www.naskitis.com/naskitis-vldbj09.pdf), you could
> effectively string the S, P and O together and create more reasonably branching
> search paths.  For example, to index the two triples mentioned above with a
> third, (Kevin loves pizza), you would have a tree like:

I'm still reading this paper, although as yet I am completely failing to
understanding how b-tries might easily accommodate namespace/context?

>> Regardless, I personally have a strong desire to keep integration with
>> external (read non-lispy) tools to a minimum.

> YES!

Great! this is really my #1 concern and interest and I should reiterate that
next to a functional persistent VG2 all other details are secondary :)

--
/s_P\


From danlentz at gmail.com  Thu Sep 15 12:30:24 2011
From: danlentz at gmail.com (Dan Lentz)
Date: Thu, 15 Sep 2011 08:30:24 -0400
Subject: [vivace-graph-devel] nodes, fixnums/upper bounds,
 and multi-constituent indices
In-Reply-To: <CACU6nW3swvmgVaOqGko9=04Dbkz+YK2TDqS6+EPz_EH6of4oAQ@mail.gmail.com>
References: <CANco9BtuYj3NEeHdEs1wyF=msmKquLW87BhKCux+_EBKGkHhqw@mail.gmail.com>
	<CACU6nW3swvmgVaOqGko9=04Dbkz+YK2TDqS6+EPz_EH6of4oAQ@mail.gmail.com>
Message-ID: <CANco9BuOsFZE4byjm-JqPKessE1nRBQpnxsgPmTfv62DZRNKfw@mail.gmail.com>

> > Another topic I have been looking at related to the indexing and
> > uuid's is the representation (reification?) of nodes, or lack
> > thereof.
> >
> > One difference in vivace graph versus other tstores I've played with
> > is the ability to reference nodes as first class "things".
>
> Maybe because they resolve to first class Lisp objects and don't
> resort to mediating the inferior objects spat out by lesser non-lispy
> sources :)
>
> No doubt this will eventually change once the VG2
> transaction/persistence/indexing stuff is better established
> (hopefully sooner than later).
>
>

Ok, actually i worked out a pleasant node/namespace/package/symbol mapping
automation last evening last night based on the "graph-words" which i'm
using as the canonical node representation with lambdas (fdefinitions) and
symbol-values to dereference and map back and forth.  It's simply
housekeeping but it's convenient, even more-so than the "bang reader" macro,
so if there is any interest i'd be happy to post a more thorough description
or code snippet.

I like the graph-words convention and looked a little bit into combining the
above with ContextL, which i believe could be used to nice effect in order
to provide a conveniently namespace-aware symbol mapping like the above, but
with symbols mapped based on dynamic context established with something like
a "with-graphs" macro.  ContextL is pretty fast at switching between these
dynamic symbol mappings, although (important note) the packages (namespaces)
themselves are static.  The "contents" (symbols) are dynamic.


> >part 2
> >
> > This sort of blends into another indexing-model question, related to
> > the current model which is based on a hierarchical index structure?
> > Couldn't additional speed be achieved though multi-constituent
> > indexing?  IE and SP index, PO, index etc in which multiple nodes of
> > a single triple are hashed in the aggregate to allow for direct
> > lookup.
>
> Having spent some more time looking at the linear-hashing papers Kevin
> provided I'm have trouble see this as an either/or situation,
> e.g. underneath its all gonna eventually wind-up as hash-tables
> arrays, integers and de-referenced bucket/node/leaf/offset pointers :)
>
> I'm interested to learn from Kevin how much of the linear-hashing
> scheme he believes is already "built-in" to the existing VG-2
> code-base.
>
> In particular, if most of the footwork for the linear-hashing work is
> already in place?
>
> And, if not whether there is some drop-in data-structure capable of
> implementing the linear-hashing scheme he envisions.
>
> And, if not what does he anticipate is required to implement a
> functional linear-hashing scheme as he envisions.
>
>
So, in effect, multi-constituent indexing should be easy to tack on later
down the road?
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/vivace-graph-devel/attachments/20110915/b9e41a2c/attachment.html>

From danlentz at gmail.com  Thu Sep 15 19:48:15 2011
From: danlentz at gmail.com (Dan Lentz)
Date: Thu, 15 Sep 2011 15:48:15 -0400
Subject: [vivace-graph-devel] transactions on top of triples, scalability,
	and another storage alternative
Message-ID: <CANco9BtJJ+RZtq5BNroknZE=ENG80240ic_pa=RQ8cBJJ7M=dw@mail.gmail.com>

On the subject of billions of triples, one thing that comes to mind is that
true scalability comes from the ability to operate in a federated model, on
a possibly distributed store.  This requires  a transaction model that
operates in a shared, multi-user scenario.  One way to implement such a
thing is *on-top* of triples. Do we have interest in this?  I have in mind
de.setf.resource, (do i sound like a broken record?) which defines such a
methodology an implements it in such a way as to abstract over the
differences between single repo and distributed repo models.

By the way, as far as distributed stores, REDIS comes to mind as a far
better alternative to cassandra. Now, of course, this does introduce a
non-lisp component...

<pause to let all the booing and hissing die down>

...BUT provides near infinite scalability and provides the capability of
both remote and local storage configurations.  Eg, press a button, deploy a
simple REDIS server to EC2 and have near infinitely scalable graph storage
with all the benefits of the hosted EC2.

I've done some work with this and have found REDIS to be very good to work
with via cl-redis.  The downside is some per-query latency, and non-lispy
backend.  The upsides are many and also include bonuses such as pubsub
queues and sorted sets, upon which it is easy to build many other structures
out of triples, which, in turn, makes VG more broadly useful and the graph
model an easier foundation to build on.

you may now resume your normally scheduled booing and hissing :)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/vivace-graph-devel/attachments/20110915/c0f06e42/attachment.html>