From monkey at sandpframing.com Sun Sep 4 21:11:15 2011 From: monkey at sandpframing.com (MON KEY) Date: Sun, 4 Sep 2011 17:11:15 -0400 Subject: [vivace-graph-devel] Recent Babel changes - discussion from github Message-ID: Hello, I recently commented on github w/r/t a recent commit which creates a new dependency on Babel: (URL `https://github.com/mon-key/vivace-graph-v2/commit/8a91ff8c52b87411bb9a816b3094c12d15ea69ed#commitcomment-568271') As the devel-mailing list is the more appropriate forum for this I figure i should forward my comments here and any follow up should occur here. Following is a transcript of the initial exchange: ,---- | mon-key added a note to 8a91ff8 repo owner | | Was there a specific reason for the change to | babel:string-to-octets?Was there a specific reason for the change to | babel:string-to-octets? `---- ,---- | kraison added a note to 8a91ff8 | | portability. some day, when I actually get around to finishing this | application, I would like it to be portable across Lisps. are you | using vivace-graph? did this change cause any troubles? `---- I don't find any immediate trouble :) I understand that the the move to Babel is one step towards removing reliances on SBCL internals. This said, I would think `flexi-streams:octets-to-string' might be a better choice esp. in so much as the longterm goals / requirements of vivace-graph-v2 are likely to eventually require some use of a portable streams library... At the very least, in so much as the callers of BABEL:OCTETS-TO-STRING BABEL:STRING-TO-OCTETS are (de)serializers immediately associated with salza2 and chipz compression, it may be worth considering that both salza2 and chipz can frob to/from octet stream directly and that flexi-streams more immediately capable of taking advantage of this than Babel's babel-streams. A quick glance at the file header of babel/src/streams.lisp may convince you that Babel is an inferior substitute for Flexi-Streams. ;-} FWIW I'm fooling around with a forked v2 - the mon_key branch holds the current changes. My intent is to explore ways in which vivace-graph might benefit from incorporation of my uuid system Unicly: https://github.com/mon-key/unicly The Unicly system has a dependency on flexi-streams's `string-to-octets'. Likwise, the uuid system has a dependency on trivial-utf-8's `string-to-octets'. To that end, my immediate concern is that there is little need in providing an additional UTF-8 portability package. Following are the respective signatures of three functions each of which satisfy nearly identical requirements: flexi-streams:octets-to-string babel:octets-to-string trivial-utf-8:utf-8-bytes-to-string ;; sb-ext:octets-to-string (vector &key (start 0) end (external-format default)) ;; flexi-streams:octets-to-string (sequence &key (start 0) (end (length sequence)) (external-format :latin1)) ;; babel:octets-to-string (vector &key (start 0) end (encoding *default-character-encoding*) (errorp (not *suppress-character-coding-errors*))) ;; trivial-utf-8:utf-8-bytes-to-string (bytes-in &key (start 0) (end (length bytes-in))) Also, note that where both the Unicly and uuid systems have a dependency on ironclad, it might be possible to shoehorn an octet-to-string function using the stuff in octet-streams.lisp esp. where Ironclad _appears_ to provide have a fairly portable and self-contained implementation of Gray Stream binary streams already. From monkey at sandpframing.com Sun Sep 4 21:28:20 2011 From: monkey at sandpframing.com (MON KEY) Date: Sun, 4 Sep 2011 17:28:20 -0400 Subject: [vivace-graph-devel] triple-equal implemented as v1 uuid comparison? -- issue 1 on github Message-ID: "triple-equal implemented as v1 uuid comparison?" I'm forwarding a transcript of disccussion occuring on github as vivace-graph-v2 issue "triple-equal implemented as v1 uuid comparison?" (URL `https://github.com/kraison/vivace-graph-v2/issues/1') Hopefully this will faciliate continuing the discussion here on the mailing list instead. ,---- | danlentz opened this issue about 11 hours ago | triple-equal implemented as v1 uuid comparison? | | Does that yield the appropriate semantics? For example if i happen to | compare it with a triple with the same s p o from a different graph, | i'd expect them to unify. What prevents duplicate triples from leaking | into the same graph? `---- ,---- | kraison commented | | Dan, would you mind bringing this discussion over the the mailing | list? You can join it here: | http://common-lisp.net/project/vivace-graph/ `---- ,---- | kraison commented | | take a look at add-triple in triples.lisp to see how duplicates are | avoided. as for unification, there is not an easy answer to how | triples from different graphs should unify. calling the basic data | type in VG a triple is perhaps a misnomer; it is really a quint, with | the graph as one of the 5 slots (the other 4 being s, p, o and | id). so, i think that if you want to unify across graphs there should | be some sort of special functor for that purpose, otherwise you | confound the notion of equality. i see triple-equal as being more akin | to Lisp's EQL, which compares addresses and not content; the triple's | uuid is essentially its address. this is of course debatable and i am | happy to hear dissenting opinions. :) also, see the definitions of | q-/4 and q-/3 in prolog-functors.lisp for how unification is done. `---- ,---- | danlentz commented | | wouldn't that be more equivalent to triple-eq? You're only matching | on a single instance of the triple. If you delete (really delete i | mean) and re-add the same spog they wont be eql. wilbur interns | quads by string comparison of spo constituent node-uri and | de.setf.resource does sha1 hex-id on the byte array. Well i'll play | around & also make sure to join the mailing list too. | | Enjoy your vacation I look forward to talking again. `---- ,---- | mon-key commented | | Hi kraison and danlentz | | I think there is room for lots of discussion around VG's | triple-equality and I would love to contribute and learn more. | | Can we take this up on the vivace-graph-devel mailing list? | | Tracking communication on github across multiple repos and branches is | a PITA :) `---- From raison at chatsubo.net Sun Sep 4 21:27:19 2011 From: raison at chatsubo.net (Kevin Raison) Date: Sun, 04 Sep 2011 14:27:19 -0700 Subject: [vivace-graph-devel] triple-equal semantics Message-ID: <4E63ED37.8010605@chatsubo.net> >From the github discussion: Dan: triple-equal implemented as v1 uuid comparison? Does that yield the appropriate semantics? For example if i happen to compare it with a triple with the same s p o from a different graph, i'd expect them to unify. What prevents duplicate triples from leaking into the same graph? Kevin: take a look at add-triple in triples.lisp to see how duplicates are avoided. as for unification, there is not an easy answer to how triples from different graphs should unify. calling the basic data type in VG a triple is perhaps a misnomer; it is really a quint, with the graph as one of the 5 slots (the other 4 being s, p, o and id). so, i think that if you want to unify across graphs there should be some sort of special functor for that purpose, otherwise you confound the notion of equality. i see triple-equal as being more akin to Lisp's EQL, which compares addresses and not content; the triple's uuid is essentially its address. this is of course debatable and i am happy to hear dissenting opinions. :) also, see the definitions of q-/4 and q-/3 in prolog-functors.lisp for how unification is done. Dan: wouldn't that be more equivalent to triple-eq? You're only matching on a single instance of the triple. If you delete (really delete i mean) and re-add the same spog they wont be eql. wilbur interns quads by string comparison of spo constituent node-uri and de.setf.resource does sha1 hex-id on the byte array. Well i'll play around & also make sure to join the mailing list too. Enjoy your vacation I look forward to talking again. -K From monkey at sandpframing.com Wed Sep 7 08:27:21 2011 From: monkey at sandpframing.com (MON KEY) Date: Wed, 7 Sep 2011 04:27:21 -0400 Subject: [vivace-graph-devel] every time we UUID 128 bits die down the bit hole Message-ID: While reviewing Franz's documentation of Agraph's UPIs: http://www.franz.com/agraph/support/documentation/current/lisp-reference.html#function.make-upi it occured to me that vivace-graph-v2 should consider using Unicly https://github.com/mon-key/unicly rather the current uuid library. Obviously I'm biased :) In any event, its pretty clear that Franz is using some form of UUID truncated from 16 to 12 bytes for maintaining triple identity. What isn't clear is whether the top four bytes are needed for type addressing by the underlying Lisp or if the decision had more to do with a performance bottleneck with frobbing ~128bit normative UUIDs (e.g. as per RFC 4122). Regardless, vivace-graph-v2 should move away from uuid:make-v1-uuid (its slow, ugly, and buggy) I would suggest that there may be some significant gains to be had by: a) taking advantage of Unicly's fast v3 and v5 UUID generation I'm convinced that vivace-graph-v2 could benefit by caching UUID namespaces for its various triple indexes and using these to generate v3/v5 UUIDs instead of the current scheme of constantly hashing up disposable UUIDs by banging on the system clock! b) utilizng Unicly's ability to convert UUIDs to/from various representations it might be possible to extend Unicly's bit-vector UUID representation out beyond 128 bits in order to allow triples to carry type information. Tacking one more octet (#*11111111) onto a Unicly UUID bit-vector would buy a lot of space to address types. Likewise, taking advantage of Unicly's ability to convert UUIDs to integer values would prob. aid certain Btree schemes by branching on numeric greater/lessthan as opposed to lexical schemes which frob string-greater/string-lessthan From raison at chatsubo.net Wed Sep 7 19:53:54 2011 From: raison at chatsubo.net (Kevin Raison) Date: Wed, 07 Sep 2011 12:53:54 -0700 Subject: [vivace-graph-devel] every time we UUID 128 bits die down the bit hole In-Reply-To: References: Message-ID: <4E67CBD2.6060908@chatsubo.net> I am convinced that this is an excellent idea; I also noticed that you have been working on it in your github branch of VG. Let me know when it is ready to merge into the mainline so that we can play around. Also, to justify the original use of make-v1-uuid: it was simply easy and worked well enough. Now that there are others interested in this project, it is definitely time to use a better solution. Cheers, Kevin On 9/7/11 1:27 AM, MON KEY wrote: > While reviewing Franz's documentation of Agraph's UPIs: > > http://www.franz.com/agraph/support/documentation/current/lisp-reference.html#function.make-upi > > it occured to me that vivace-graph-v2 should consider using Unicly > https://github.com/mon-key/unicly rather the current uuid library. > > Obviously I'm biased :) > > In any event, its pretty clear that Franz is using some form of UUID > truncated from 16 to 12 bytes for maintaining triple identity. > > What isn't clear is whether the top four bytes are needed for type > addressing by the underlying Lisp or if the decision had more to do > with a performance bottleneck with frobbing ~128bit normative UUIDs > (e.g. as per RFC 4122). > > Regardless, vivace-graph-v2 should move away from uuid:make-v1-uuid > (its slow, ugly, and buggy) I would suggest that there may be some > significant gains to be had by: > > a) taking advantage of Unicly's fast v3 and v5 UUID generation I'm > convinced that vivace-graph-v2 could benefit by caching UUID > namespaces for its various triple indexes and using these to > generate v3/v5 UUIDs instead of the current scheme of constantly > hashing up disposable UUIDs by banging on the system clock! > > b) utilizng Unicly's ability to convert UUIDs to/from various > representations it might be possible to extend Unicly's bit-vector > UUID representation out beyond 128 bits in order to allow triples > to carry type information. Tacking one more octet (#*11111111) > onto a Unicly UUID bit-vector would buy a lot of space to address > types. Likewise, taking advantage of Unicly's ability to convert > UUIDs to integer values would prob. aid certain Btree schemes by > branching on numeric greater/lessthan as opposed to lexical > schemes which frob string-greater/string-lessthan > > _______________________________________________ > vivace-graph-devel mailing list > vivace-graph-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel From monkey at sandpframing.com Fri Sep 9 07:48:15 2011 From: monkey at sandpframing.com (MON KEY) Date: Fri, 9 Sep 2011 03:48:15 -0400 Subject: [vivace-graph-devel] every time we UUID 128 bits die down the bit hole In-Reply-To: <4E67CBD2.6060908@chatsubo.net> References: <4E67CBD2.6060908@chatsubo.net> Message-ID: On Wed, Sep 7, 2011 at 3:53 PM, Kevin Raison wrote: > I am convinced that this is an excellent idea; ?I also noticed that you > have been working on it in your github branch of VG. ?Let me know when > it is ready to merge into the mainline so that we can play around. > Following illustrates some possilbe utiliy of Unicly w/r/t indexes (requires Unicly from Git) (defvar *global-entity-index* (make-hash-table :test 'unicly:uuid-eql)) (defconstant +context-Z-namespace-as-ub128+ 317192554773903544674993329975922389959) (defconstant +context-Y-namespace-as-ub128+ 003012593477302450121124084036000723448) (defvar *context-Z* '()) (defvar *context-Y* '()) (defclass context () ((namespace :reader namespace) (namespace-uuid :reader namespace-uuid) (namespace-table :reader namespace-table) (namespace-index :reader namespace-index))) (defun initialize-context (integer global-idx) (let ((instance (make-instance 'context))) (setf (slot-value instance 'namespace) (unicly:uuid-from-bit-vector (unicly::uuid-integer-128-to-bit-vector integer))) (setf (slot-value instance 'namespace-uuid) (unicly:make-v5-uuid (namespace instance) (unicly:uuid-princ-to-string (namespace instance)))) (setf (slot-value instance 'namespace-table) (make-hash-table :test 'unicly:uuid-eql)) (setf (slot-value instance 'namespace-index) global-idx) (setf (gethash (namespace instance) (namespace-index instance)) (namespace-uuid instance)) (setf (gethash (namespace-uuid instance) (namespace-index instance)) (namespace-table instance)) instance)) (defun get-entity-in-context (string-entity context-instance &key (set-if-not nil)) (declare (string string-entity) (boolean set-if-not) (context context-instance)) (let ((entity-uuid (unicly:make-v5-uuid (namespace context-instance) string-entity)) (index (namespace-index context-instance)) (did-set '())) (labels ((get-global-entity-uuid () (gethash entity-uuid index)) (set-global-entity-uuid () (setf (gethash entity-uuid index) string-entity did-set t) entity-uuid) (unset-whatset-global-entity () (remhash entity-uuid index) (setf did-set nil) (return-from get-entity-in-context (values nil nil))) (global-entity-chk () (let ((entity-if (get-global-entity-uuid))) (etypecase entity-if (null (if set-if-not (set-global-entity-uuid) (return-from get-entity-in-context nil))) (string (if (string= entity-if string-entity) entity-uuid (return-from get-entity-in-context (when set-if-not (values nil did-set)))))))) (deref-context-table () (let* ((entity-chk (global-entity-chk)) (context-chk (gethash (namespace context-instance) index)) (context-deref (if context-chk ;; we have the uuid of context-namespace (gethash context-chk index) (cond (did-set (unset-whatset-global-entity)) (t (return-from get-entity-in-context nil))))) (table-deref (if context-deref ;; we have the assoicated context hash-table context-deref (cond (did-set (unset-whatset-global-entity)) (t (return-from get-entity-in-context nil))))) (table-get-if (gethash entity-chk table-deref))) (if table-get-if table-get-if (when set-if-not (setf (gethash entity-chk table-deref) string-entity did-set t)))))) (if set-if-not (values (deref-context-table) did-set) (deref-context-table))))) (setf *context-Z* (initialize-context +context-Z-namespace-as-ub128+ *global-entity-index*)) (setf *context-Y* (initialize-context +context-Y-namespace-as-ub128+ *global-entity-index*)) (setf (gethash (unicly:make-v5-uuid (namespace *context-Z*) "ENTITY-W") (namespace-index *context-Z*)) "ENTITY-W") (setf (gethash (unicly:make-v5-uuid (namespace *context-Z*) "ENTITY-W") (namespace-table *context-Z*)) "ENTITY-W") (setf (gethash (unicly:make-v5-uuid (namespace *context-Y*) "ENTITY-W") (namespace-index *context-Y*)) "ENTITY-W") (setf (gethash (unicly:make-v5-uuid (namespace *context-Y*) "ENTITY-W") (namespace-table *context-Y*)) "ENTITY-W") (get-entity-in-context "ENTITY-W" *context-Z*) (get-entity-in-context "ENTITY-W" *context-Y*) (get-entity-in-context "ENTITY-O" *context-Z*) (get-entity-in-context "ENTITY-O" *context-Z* :set-if-not t) (get-entity-in-context "ENTITY-O" *context-Y*) (get-entity-in-context "ENTITY-O" *context-Y* :set-if-not t) > Also, to justify the original use of make-v1-uuid: ?it was simply easy > and worked well enough. ?Now that there are others interested in this > project, it is definitely time to use a better solution. Great! I'm glad you agree. FTR following are the timings i get using the uuid library from Quicklisp comparing make-v1-uuid with make-v4-uuid (sb-ext:gc :full t) (time (dotimes (i 10000) (uuid:make-v1-uuid))) ;=>Evaluation took: ; 8.396 seconds of real time ; 0.500924 seconds of total run time (0.326950 user, 0.173974 system) ; 5.97% CPU ; 25,125,295,748 processor cycles ; 8,235,056 bytes consed (sb-ext:gc :full t) (time (dotimes (i 10000) (uuid:make-v4-uuid))) ;=>Evaluation took: ; 0.020 seconds of real time ; 0.018998 seconds of total run time (0.017998 user, 0.001000 system) ; 95.00% CPU ; 58,466,168 processor cycles ; 2,311,272 bytes consed For posterity here's an illustration of _why_ make-v1-uuid is such a dog: (in-package #:uuid) ;; Redefine uuid::get-timestamp to show us when it calls sleep: (let ((uuids-this-tick 0) (last-time 0) ;; add a counter to track how times sleep has been called (sleep-count 0)) (defun get-timestamp () "Get timestamp, compensate nanoseconds intervals" (unwind-protect (tagbody restart (let ((time-now (+ (* (get-universal-time) 10000000) 100103040000000000))) ;10010304000 is time between 1582-10-15 and 1900-01-01 in seconds (cond ((not (= last-time time-now)) (setf uuids-this-tick 0 last-time time-now) (return-from get-timestamp time-now)) (T (cond ((< uuids-this-tick *ticks-per-count*) (incf uuids-this-tick) (return-from get-timestamp (+ time-now uuids-this-tick))) (T ;; add a logging form to show us how many times we sleep per invocation: (format t "slept count: ~D~%" (incf sleep-count)) (sleep 0.0001) (go restart))))))) (setf sleep-count 0)))) (dotimes (i 10000) (uuid:make-v1-uuid)) `uuid:make-v1-uuid' relies on `uuid::get-timestamp' which evaluates (sleep 0.0001) quite a bit (at least on my machine 32bit x86 running SBCL 1.50) From monkey at sandpframing.com Sun Sep 11 20:00:54 2011 From: monkey at sandpframing.com (MON KEY) Date: Sun, 11 Sep 2011 16:00:54 -0400 Subject: [vivace-graph-devel] Recent Babel changes - discussion from github In-Reply-To: References: Message-ID: vivace-graph-v2 + UTF-8 Per kraison's recent inability to build Unicly with SBCL on MacOS and Lispworks it may be worth considering now how vivace-graph-v2 will handle similar issues. At issue is whether vivace-graph-v2 should constrain all string data to be contained of characters encoded as UTF-8. I would think that this is a reasonable contraint to apply given that vivace-graph-v2 is intent on targeting RDF-aware and/or RDF-like applications where UTF-8 has guaranteed ubiquity. Indeed, for my part I have a direct and standing need to represent triple subjects and objects as string data in character sets with encodings that extend beyond ASCII and LATIN-1 and will consider it a deal breaker if vivace-graph-v2 is unable to reliably handle UTF-8. Regardless, to the extent by which it is deemed desirable for vivace-graph-v2 to enforce UTF-8 constraints around the string data it manipulates it is worth considering how the current system might reliably and reasonably enforce such a constraint should the underlying system prove incapable of internally handling UTF-8 character encodings. As it stands now, a method `serialize' in vivace-graph/serialize.lisp relies on `babel:string-to-octets' and two `deserialize' methods in vivace-graph-v2/deserialize.lisp reliy on `babel:octets-to-string'. Currently vivace-graph-v2 has a dependency on the Babel system for converting strings to/from octets via Babel functions `babel:octets-to-string' and `babel:string-to-octets' which each default their :ENCODING keyowrd argument to value `babel-encodings:*default-character-encoding*' which itself defaults to :UTF-8. IOW, unless explicitly specified otherwise both `babel:octets-to-string' and `babel:string-to-octets' will default all string/octet conversions to :UTF-8 and error in the event that the defaulting behaviour is not supported by the underlying lisp implementation as per their defaulting keyword forms: (errorp (not *suppress-character-coding-errors*)) In any event, there may be some potential for vivace-graph-v2's serialization/deserialization routines to fail at inoportune moments given the following from the file header of babel/src/strings.lisp ,---- | The usefulness of this string/octets interface of Babel's is very | limited on Lisps with 8-bit characters which will in effect only | support the latin-1 subset of Unicode. That is, all encodings are | supported but we can only store the first 256 code points in Lisp | strings. Support for using other 8-bit encodings for strings on | these Lisps could be added with an extra encoding/decoding step. | Supporting other encodings with larger code units would be silly | (it would break expectations about common string operations) and | better done with something like Closure's runes. `---- The Closure system has a direct dependency on both Babel and an indirect dependency on Flexi-Streams via Closure-html. Which is to say, there is no reason why either the Babel or Flexi-Streams systems should be prefered over the other in-so-muchas both are likely to remain dependencies of the vivace-graph-v2 system. As mentioned already, my personal preference w/r/t to UTF-8 and character-encoding/character conversion interop is for the Flexi-Streams system and not the Babel system. This preference (mostly trivial) is by no means a knock on the Babel system, and mostly amounts to my belief that Flexi has more transparent argument signatures with equivalent SBCL procedures. vivace-graph-v2 currently already has an indirect dependency via its dependency on the hunchentoot system (albeit currently un-needed) in turn Hunchentoot has a dependency on flexi-streams. In the event that vivace-graph-v2 should ever incorporate a direct mechanism for frobbing RDF data such mechanism is very likely to necessitate a dependency on the CXML system which currently has a pre-existing dependency on Closure-Html system which in turn currently has a dependency on the Flexi-Streams system. From monkey at sandpframing.com Mon Sep 12 03:53:20 2011 From: monkey at sandpframing.com (MON KEY) Date: Sun, 11 Sep 2011 23:53:20 -0400 Subject: [vivace-graph-devel] Rucksack Btree number-indexes on UUID integer-128 representations -- a sketch Message-ID: Utilizing the idea sketched below could get us much of the way towards a functional implementation of Rucksack's Btree number-indexes on UUIDs... Following is a toy prototype which extends Unicly's UUID class UNIQUE-UNIVERSAL-IDENTIFIER by subclassing UUID-INDEXABLE-V5 and adding two new slots which "self host" bit-vector and integer-128s representations. These are populated in an after method specialized on the class UUID-INDEXABLE-V5. e.g. with an index spec like: (btree :key< uuid-< :value= uuid-eql :value-type persistent-object) (defclass uuid-indexable-v5 (unicly:unique-universal-identifier) ((bit-vector :reader bit-vector-of-uuid) (integer-128 :reader integer-128-of-uuid))) (defmethod initialize-instance :after ((obj uuid-indexable-v5) &key &allow-other-keys) (setf (slot-value obj 'bit-vector) (unicly:uuid-to-bit-vector obj)) (setf (slot-value obj 'integer-128) (unicly::uuid-bit-vector-to-integer (slot-value obj 'bit-vector)))) (declaim (inline digested-v5-uuid-indexed)) (defun digested-v5-uuid-indexed (v5-digest-byte-array) (declare (type unicly::uuid-byte-array-20 unicly::v5-digest-byte-array) (inline unicly::%uuid_time-low-request unicly::%uuid_time-mid-request unicly::%uuid_time-high-and-version-request unicly::%uuid_clock-seq-and-reserved-request unicly::%uuid_node-request) (optimize (speed 3))) (the uuid-indexable-v5 (make-instance 'uuid-indexable-v5 :%uuid_time-low (unicly::%uuid_time-low-request v5-digest-byte-array) :%uuid_time-mid (unicly::%uuid_time-mid-request v5-digest-byte-array) :%uuid_time-high-and-version (unicly::%uuid_time-high-and-version-request v5-digest-byte-array 5) :%uuid_clock-seq-and-reserved (unicly::%uuid_clock-seq-and-reserved-request v5-digest-byte-array) :%uuid_clock-seq-low (the unicly::uuid-ub8 (unicly::%uuid_clock-seq-low-request v5-digest-byte-array)) :%uuid_node (unicly::%uuid_node-request v5-digest-byte-array)))) (defun make-v5-uuid-indexed (namespace name) (declare (type string name) (type unicly:unique-universal-identifier namespace) (inline unicly::uuid-digest-uuid-instance digested-v5-uuid-indexed) (optimize (speed 3))) (the (values uuid-indexable-v5 &optional) (digested-v5-uuid-indexed (the unicly::uuid-byte-array-20 (unicly::uuid-digest-uuid-instance 5 namespace name))))) (defparameter *tt--indexed* (make-v5-uuid-indexed unicly:*uuid-namespace-dns* "bubba")) ; => *TT--INDEXED* (bit-vector-of-uuid *TT--INDEXED*) ;=> #*1110111010100001000100000101111000... (integer-128-of-uuid *TT--INDEXED*) ;=> 317192554773903544674993329975922389959 (unicly:unique-universal-identifier-p *tt--indexed*) ;=> T (unicly:uuid-princ-to-string *tt--indexed*) ;=> "eea1105e-3681-5117-99b6-7b2b5fe1f3c7" (unicly::uuid-to-byte-array *tt--indexed*) ;=> #(238 161 16 94 54 129 81 23 153 182 123 43 95 225 243 199) (unicly::uuid-from-bit-vector (bit-vector-of-uuid *tt--indexed*)) ;=> eea1105e-3681-5117-99b6-7b2b5fe1f3c7 (describe *TT--INDEXED*) ; => eea1105e-3681-5117-99b6-7b2b5fe1f3c7 ; [standard-object] ; ; Slots with :INSTANCE allocation: ; { ... %uuid_ slots elided ... } ; BIT-VECTOR = #*11101110101000010001000001011110001101101000000101010001000101111001.. ; INTEGER-128 = 317192554773903544674993329975922389959 From monkey at sandpframing.com Mon Sep 12 04:38:51 2011 From: monkey at sandpframing.com (MON KEY) Date: Mon, 12 Sep 2011 00:38:51 -0400 Subject: [vivace-graph-devel] Rucksack Btree number-indexes on UUID integer-128 representations -- a sketch In-Reply-To: References: Message-ID: Some timings using the procedures in unicly/unicly-timings.lisp First we populate an array of 1mil elts each a random length string of 1,36 randomly chosen UTF-8 characters. Next we get a baseline timing for unicly::make-v5-uuid by iterating over that array. Our baseline shows that unicly::make-v5-uuid can generate approx. 79611.50 v5 UUIDs per second Following that we over iterate the same array with `make-v5-uuid-indexed' as per the definition from the previous message. This measurement indicates we can generate approx. 28376.04 v5 UUIDs per second where each of these UUID objects contains a cache of its 128-bit integer representation as well as an equivalent bit-vector. IOW, the speed cost of minting a v5 UUID with the two additional cached slots is approx 2.8x the un-cached version. This doesn't seem an insurmountable cost given that we would no longer have to worry about performing a conversion when walking the Btree for lookup/insertion. Timings follow: On SBCL 1.051 1.0.51.28-42fbc5e x86-32 on Linux using recent Unicly from Github. (loop for x from 0 below 1000000 do (setf (aref *tt--rnd* x) (make-random-string 36))) (generic-gc) (time (loop for x across *tt--rnd* do (unicly::make-v5-uuid unicly::*uuid-namespace-dns* x))) ; Evaluation took: ; 12.561 seconds of real time ; 12.549092 seconds of total run time (12.526096 user, 0.022996 system) ; [ Run times consist of 0.837 seconds GC time, and 11.713 seconds non-GC time. ] ; 99.90% CPU ; 20,882,950,670 processor cycles ; 961,227,776 bytes consed (format nil "~,2F v5 UUIDs per second" (/ 1000000 12.561)) "79611.50 v5 UUIDs per second" (generic-gc) (time (loop for x across *tt--rnd* do (make-v5-uuid-indexed unicly::*uuid-namespace-dns* x))) ; Evaluation took: ; 35.241 seconds of real time ; 35.243642 seconds of total run time (35.156655 user, 0.086987 system) ; [ Run times consist of 3.851 seconds GC time, and 31.393 seconds non-GC time. ] ; 100.01% CPU ; 58,586,949,330 processor cycles ; 4,101,235,976 bytes consed (format nil "~,2F v5 UUIDs per second" (/ 1000000 35.241)) ;=> "28376.04 v5 UUIDs per second" From monkey at sandpframing.com Mon Sep 12 22:56:48 2011 From: monkey at sandpframing.com (MON KEY) Date: Mon, 12 Sep 2011 18:56:48 -0400 Subject: [vivace-graph-devel] Rucksack Btree number-indexes on UUID integer-128 representations -- a sketch In-Reply-To: References: Message-ID: I've made some more timings to initially gauge how long it might take to do a "key-<" lookup on a v4 UUID bit-vector. Also, i've included some examples which indicate where fanout in a Btree of v4-uuids should occur. Following parameter is used by function `find-first-bit' defined below: (defparameter *bit-vector-bit-table* (make-array 129)) Add 127 bit-vectors each with one bit set at an offset in range 0,127: (flet ((set-one-bit (idx) (let ((bv (make-array 128 :element-type 'bit))) (setf (sbit bv idx) 1) bv))) (loop for idx from 0 below 128 do (setf (aref *bit-vector-bit-table* idx) (set-one-bit idx)) finally (setf (aref *bit-vector-bit-table* 128) (unicly::uuid-bit-vector-128-zeroed)))) ;; Find the first non-zero bit in UUID-BV. This function is slower ;; than `find-first-bit-by-bit' but is more readily adaptable for use ;; with Btrees where there is a requirement to descend nodes. Walk ;; each bit-vector bv-N in *bit-vector-bit-table* taking the ;; `cl:bit-and' of bv-N and UUID-BV. We put the return value of each ;; `cl:bit-and' on the local loop var maybe-not-zeroed. As soon as ;; maybe-not-zeroed is not equal the equivalent of ;; (unicly::uuid-bit-vector-128-zeroed) we return the index of the non-zero bit found. (defun find-first-bit (uuid-bv) (loop with always-zeroed = (aref *bit-vector-bit-table* 128) with maybe-not-zeroed = (unicly::uuid-bit-vector-128-zeroed) for bv across *bit-vector-bit-table* for cnt from 0 below 128 do (bit-and bv uuid-bv maybe-not-zeroed) until (not (equal always-zeroed maybe-not-zeroed)) finally (return cnt))) ;; find the first non-zero bit in uuid-bv and return its position. (defun find-first-bit-by-bit (uuid-bv) (loop for x across uuid-bv for y from 0 below 128 when (plusp (sbit uuid-bv y)) do (return y))) This example shows a toy "bv-key->" in which we test only the MSB bit (the 0 bit) for two v4 UUID bit-vectors. Note, we still have +/- 127 more bits to traverse before we might find which node the bit-vector belongs to: (loop repeat 100 collect (> (find-first-bit-by-bit (unicly::uuid-to-bit-vector (unicly::make-v4-uuid))) (find-first-bit-by-bit (unicly::uuid-to-bit-vector (unicly::make-v4-uuid))))) ;; An array of 100k elts. We use it to find the distribution of ;; first-bit non-zero bits in v4 UUIDs (defparameter *tt--sample-bv-array* (make-array 100000)) ;; Poplute the array of variable *tt--sample-bv-array* with 100k new ;; v4 uuid bit-vectors. (defun make-new-random-uuid-table () (loop for x from 0 below 100000 do (setf (aref *tt--sample-bv-array* x) (unicly:uuid-to-bit-vector (unicly:make-v4-uuid))))) ;; Do it now. (make-new-random-uuid-table) ;; (aref *tt--sample-bv-array* 99999) Timing for `find-first-bit' for all 100k elts of `*tt--sample-bv-array*': (sb-ext:gc :full t) (time (loop for x from 0 below 100000 do (find-first-bit (aref *tt--sample-bv-array* x)))) ;; ;; Evaluation took: ;; 0.049 seconds of real time ;; 0.048993 seconds of total run time (0.041994 user, 0.006999 system) ;; 100.00% CPU ;; 147,111,712 processor cycles ;; 4,798,088 bytes consed ;; timing for `find-first-bit-by-bit' for all 100k elts of ;; `*tt--sample-bv-array*' (sb-ext:gc :full t) (time (loop for x from 0 below 100000 do (find-first-bit-by-bit (aref *tt--sample-bv-array* x)))) ;; ;; Evaluation took: ;; 0.009 seconds of real time ;; 0.007999 seconds of total run time (0.007999 user, 0.000000 system) ;; 88.89% CPU ;; 27,299,580 processor cycles ;; 0 bytes consed ;; Evaluating `get-random-uuid-table-distribution' should give an idea ;; of the inital fanout we might expect. (defun get-random-uuid-table-distribution () (let ((cnt-table (make-hash-table))) (loop for x from 0 below 128 do (setf (gethash x cnt-table) 0)) (loop initially (make-new-random-uuid-table) for x from 0 below 100000 for y = (find-first-bit-by-bit (aref *tt--sample-bv-array* x)) do (incf (gethash y cnt-table)) finally (return (loop for x from 0 below 128 collect (cons x (gethash x cnt-table)) into idx-counts finally (return (remove-if #'null (map 'list #'(lambda (x) (and (plusp (cdr x)) x)) idx-counts)))))))) (dotimes (i 10) (terpri) (print (get-random-uuid-table-distribution))) ; ((0 . 50005) (1 . 24972) (2 . 12341) (3 . 6323) (4 . 3123) (5 . 1628) ; (6 . 789) (7 . 420) (8 . 192) (9 . 114) (10 . 47) (11 . 27) (12 . 9) ; (13 . 4) (14 . 2) (15 . 3) (19 . 1)) ; ; ((0 . 50063) (1 . 25070) (2 . 12402) (3 . 6229) (4 . 3052) (5 . 1630) ; (6 . 769) (7 . 384) (8 . 197) (9 . 115) (10 . 51) (11 . 18) (12 . 10) ; (13 . 6) (14 . 3) (15 . 1)) ; ; ((0 . 49870) (1 . 25112) (2 . 12435) (3 . 6268) (4 . 3187) (5 . 1550) ; (6 . 795) (7 . 393) (8 . 189) (9 . 105) (10 . 46) (11 . 25) (12 . 16) ; (13 . 1) (14 . 5) (16 . 3)) ; ; ((0 . 49986) (1 . 24947) (2 . 12571) (3 . 6243) (4 . 3121) (5 . 1538) ; (6 . 783) (7 . 411) (8 . 200) (9 . 101) (10 . 48) (11 . 26) (12 . 12) ; (13 . 6) (14 . 4) (15 . 1) (16 . 1) (17 . 1)) ; ; ((0 . 49920) (1 . 25063) (2 . 12631) (3 . 6290) (4 . 3060) (5 . 1472) ; (6 . 780) (7 . 377) (8 . 193) (9 . 108) (10 . 44) (11 . 29) (12 . 16) ; (13 . 4) (14 . 5) (15 . 5) (16 . 3)) ; ; ((0 . 50068) (1 . 24900) (2 . 12508) (3 . 6245) (4 . 3150) (5 . 1585) ; (6 . 761) (7 . 395) (8 . 194) (9 . 99) (10 . 41) (11 . 35) (12 . 13) ; (13 . 3) (14 . 3)) ; ; ((0 . 50170) (1 . 24891) (2 . 12450) (3 . 6221) (4 . 3210) (5 . 1571) ; (6 . 726) (7 . 383) (8 . 189) (9 . 95) (10 . 51) (11 . 21) (12 . 8) ; (13 . 5) (14 . 6) (15 . 2) (17 . 1)) ; ; ((0 . 50115) (1 . 24924) (2 . 12751) (3 . 6090) (4 . 3049) (5 . 1541) ; (6 . 769) (7 . 381) (8 . 185) (9 . 88) (10 . 54) (11 . 25) (12 . 14) ; (13 . 9) (14 . 1) (15 . 2) (17 . 2)) ; ; ((0 . 49698) (1 . 25089) (2 . 12637) (3 . 6268) (4 . 3143) (5 . 1595) ; (6 . 753) (7 . 382) (8 . 209) (9 . 114) (10 . 59) (11 . 28) (12 . 9) ; (13 . 9) (14 . 4) (15 . 2) (16 . 1)) ; ; ((0 . 49755) (1 . 25134) (2 . 12420) (3 . 6368) (4 . 3104) (5 . 1596) ; (6 . 802) (7 . 421) (8 . 179) (9 . 103) (10 . 51) (11 . 28) (12 . 21) ; (13 . 12) (14 . 4) (16 . 2)) :NOTE One Thing to take into cosideration is that a Btree scheme that frobs the UUID bit-vector might want to take care to be unicly::uuid-version-bit-vector aware. E.g. the output from following example makes it pretty clear that any node branching on the value of bit 49 is gonna always contain every v4 UUID. Note also that this is equally true of the uuid-integer-128 representation... After evaluating form below you should see a line of 1's at column 52 in your slime-repl: (dotimes (i 100 (terpri)) (terpri) (unicly:uuid-print-bit-vector t (unicly:make-v4-uuid))) From danlentz at gmail.com Tue Sep 13 14:44:31 2011 From: danlentz at gmail.com (Dan Lentz) Date: Tue, 13 Sep 2011 10:44:31 -0400 Subject: [vivace-graph-devel] elephant Message-ID: I've been thinking about persistent index strategies, and have read through the paper on fpb+trees, and have had a few thoughts. The first and simplest is to make use of elephant. Its not very exotic or course but it would allow a model in which triples can be first class objects, yet leverage a reasonably performant back end (bdb). In addition, the set-valued slots and association slots are nice abstractions on top of which to build the rdf semantics (properties, extensions) on top of a real clos mop. I figured I'd shoot the idea onto the mailing list to get a feel for the degree and nature of agreement/disagreement. -------------- next part -------------- An HTML attachment was scrubbed... URL: From raison at chatsubo.net Tue Sep 13 16:15:59 2011 From: raison at chatsubo.net (Kevin Raison) Date: Tue, 13 Sep 2011 09:15:59 -0700 Subject: [vivace-graph-devel] elephant In-Reply-To: References: Message-ID: <4E6F81BF.5060001@chatsubo.net> Dan, I actually already tried using elephant at a very early stage in the development of VG. While elephant is an excellent library (which I have used in many projects), it is simply too slow to be of use here. One of the goals of VG is to be fast and to be able to handle billions of triples. Elephant slows down very quickly because of a number of factors, including its use of BerkeleyDB rather than a native Lisp back-end store, as well as its complexity. VG does not need the level of complexity or abstraction that you get with elephant's indexes and class redefinition logic. In our case, we are dealing with one class, the triple, and as such, we can be very specific about how we store and index it as well as how we deal with it in memory. Standard b-trees simply won't efficiently handle the fanout of a large triple store; we need something specifically tuned to our purpose. I am fairly certain that linear hashing for triple storage combined with b-tries or fb-trees for indexing would do much better. There are other graph dbs out there that use this strategy. See http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ for some good discussion. Another goal of mine is to develop a native Lisp back-end that projects like elephant might be able to take advantage of; not relying on external, non-Lisp libraries is a good thing, especially BerkeleyDB, given its terrible licensing terms (thanks, Oracle). You mention that you had some further thoughts after reading the fractal pre-fetching b-trees paper; care to share? -Kevin On 09/13/2011 07:44 AM, Dan Lentz wrote: > I've been thinking about persistent index strategies, and have read > through the paper on fpb+trees, and have had a few thoughts. > > The first and simplest is to make use of elephant. Its not very exotic > or course but it would allow a model in which triples can be first class > objects, yet leverage a reasonably performant back end (bdb). In > addition, the set-valued slots and association slots are nice > abstractions on top of which to build the rdf semantics (properties, > extensions) on top of a real clos mop. > > I figured I'd shoot the idea onto the mailing list to get a feel for the > degree and nature of agreement/disagreement. > > > _______________________________________________ > vivace-graph-devel mailing list > vivace-graph-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel From kraison at common-lisp.net Tue Sep 13 15:53:19 2011 From: kraison at common-lisp.net (Kevin Raison) Date: Tue, 13 Sep 2011 08:53:19 -0700 Subject: [vivace-graph-devel] elephant In-Reply-To: References: Message-ID: <4E6F7C6F.1090003@common-lisp.net> Dan, I actually already tried using elephant at a very early stage in the development of VG. While elephant is an excellent library (which I have used in many projects), it is simply too slow to be of use here. One of the goals of VG is to be fast and to be able to handle billions of triples. Elephant slows down very quickly because of a number of factors, including its use of BerkeleyDB rather than a native Lisp back-end store, as well as its complexity. VG does not need the level of complexity or abstraction that you get with elephant's indexes and class redefinition logic. In our case, we are dealing with one class, the triple, and as such, we can be very specific about how we store and index it as well as how we deal with it in memory. Standard b-trees simply won't efficiently handle the fanout of a large triple store; we need something specifically tuned to our purpose. I am fairly certain that linear hashing for triple storage combined with b-tries or fb-trees for indexing would do much better. There are other graph dbs out there that use this strategy. See http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ for some good discussion. Another goal of mine is to develop a native Lisp back-end that projects like elephant might be able to take advantage of; not relying on external, non-Lisp libraries is a good thing, especially BerkeleyDB, given its terrible licensing terms (thanks, Oracle). You mention that you had some further thoughts after reading the fractal pre-fetching b-trees paper; care to share? -Kevin On 09/13/2011 07:44 AM, Dan Lentz wrote: > I've been thinking about persistent index strategies, and have read > through the paper on fpb+trees, and have had a few thoughts. > > The first and simplest is to make use of elephant. Its not very exotic > or course but it would allow a model in which triples can be first class > objects, yet leverage a reasonably performant back end (bdb). In > addition, the set-valued slots and association slots are nice > abstractions on top of which to build the rdf semantics (properties, > extensions) on top of a real clos mop. > > I figured I'd shoot the idea onto the mailing list to get a feel for the > degree and nature of agreement/disagreement. > > > _______________________________________________ > vivace-graph-devel mailing list > vivace-graph-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel From monkey at sandpframing.com Wed Sep 14 02:36:03 2011 From: monkey at sandpframing.com (MON KEY) Date: Tue, 13 Sep 2011 22:36:03 -0400 Subject: [vivace-graph-devel] elephant In-Reply-To: <4E6F81BF.5060001@chatsubo.net> References: <4E6F81BF.5060001@chatsubo.net> Message-ID: Hi Dan & Kevin, On Tue, Sep 13, 2011 at 12:15 PM, Kevin Raison wrote: > Dan, I actually already tried using elephant at a very early stage in the > development of VG. While elephant is an excellent library (which I have > used in many projects), it is simply too slow to be of use here. One of the > goals of VG is to be fast and to be able to handle billions of triples. What kind of value are you shooting for here? E.g. What do you think is a reasonable value for X? (funcall #'(lambda (x) (format nil "~R triples?" (* 1000 1000000 x))) ???) FWIW I would suggest that X need not be anything larger than 8 :) and that realistically this is a more reasonable upper bounds: (format nil "~R triples" #x7fffffff) `cl:sxhash' return value HASH-CODE is specified as a non-negative fixnum: ,---- | The HASH-CODE is intended for hashing. This places no verifiable | constraint on a conforming implementation, but the intent is that | an implementation should make a good-faith effort to produce | HASH-CODES that are well distributed within the range of | non-negative fixnums. `---- So, assuming the underlying Lisp does make a good-faith effort to distribute hash-codes across the range of 0,most-positive-fixnum if we wanted one gimongous linearly-hashed table of ~8 billion triple thingies wouldn't we need _at least_ a theoretical (format nil "~R fixnums" #X3FFFFFFFF) for indexing such a gimongoid? And if so, would we not run out of fixnums at: (format nil "somehwere around fixnum ~R" #x1FFFFFFF) on an x86-32 SBCl? In any event, assuming fixnums are the lightest possible serializable key in a hash-table and that the range of thoese keys has a performative upper bounds of #x1FFFFFFF on x86-32 SBCL and #x3FFFFFFFF on x86-64 we're likely to require at least ~3 bytes of diskspace per 32bit key and ~7 bytes per 64bit key. (format nil "Noting that:~% ~R triples~% is a ~A bit number" #x3FFFFFFFF (integer-length (1- (ash 1 64)))) So serializing just the fixnum integer keys of (format nil "~R triples" #x7fffffff) is liable to require (format nil "somewhere less than ~D GB on a 32bit machinine" (nth-value 0 (round (* #x7fffffff 3) (* 1024 1024 1024)))) (format nil "somewhere less than ~D GB on a 64bit machinine" (nth-value 0 (round (* #x7fffffff 7) (* 1024 1024 1024)))) (format nil "Somewhere slighly less than ~D GB on a 32bit machine~%~ and somewhere sligthly less than ~D GB on a 64bit machine" (nth-value 0 (round (* #x7fffffff 3) (* 1024 1024 1024))) (nth-value 0 (round (* #x7fffffff 7) (* 1024 1024 1024)))) Please correct me if my math is out of whack? Also, assuming these are reasonable amounts for serializing just the hash-table fixnum integer keys of: (format nil "~R triples" #x7fffffff) Is it not reasonable to assume that the above values might serve as a good guidepost for what we might expect of an in-memory footprint of the data-structures holding those: (format nil "~R triples" #x7fffffff) Even with MMAPing there is there not still some significant overhead associated with deserializing the MMAPPed data to something lispy? > Elephant slows down very quickly because of a number of factors, including > its use of BerkeleyDB rather than a native Lisp back-end store, as well as Not to mention there are some licensing issues surrounding BDB... ... FSVO "some" ... > its complexity. VG does not need the level of complexity or abstraction > that you get with elephant's indexes and class redefinition logic. Naively, I would expect the MOP stuff to be a factor. Is it? > In our > case, we are dealing with one class, the triple, and as such, we can be very > specific about how we store and index it as well as how we deal with it in > memory. Standard b-trees simply won't efficiently handle the fanout of a > large triple store; we need something specifically tuned to our purpose. Why not? I'm under the impression that many of the big Linux distros will soon release with Btrfs as the default file-system... and either Ted Tso is sandbagging for google with his recent endoresement of Btrfs over ext4 or there must be at least some utility for B+trees :) Also what is a "standard" b-tree? FTR I confuse myself when referencing b-trees :) It might be helpful to establish some prototocal for the datastructures in question -- a wiki-link would suffice. > am fairly certain that linear hashing for triple storage combined with > b-tries or fb-trees for indexing would do much better. While i'm not convinced that b-tries are TRT, I certainly agree that linear hashing is (at least for in memory data)! FWIW apropos all the recent UUID bit-vector timing junk i've posted here recently I figured it might be prudent to take some measurements using just 128-bit integers for indexing... I was pleasantly surprised to find that even with the relatively big 128bit bignum's SBCL hash-table lookup is pretty damn snappy with a large(ish) number of key/value pairs in the range of 500k-1mil Indeed, once optimizations around the allocation of the unerlying hash-table are made by massaging the value to make-hash-table's :SIZE keyword it gets even better! > There are other graph dbs out there that use this strategy. See > http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ > for some good discussion. One of the bullet-point API implementation details i found ineresting "Items are identified by a string unique identifier which maps to an integer index." Is implying the effective equivalent of: (assoc "stringy-id" '(("stringy-id" . 123456789)) :test 'equal) Or the inverse: (assoc 123456789 '((123456789 . "stringy-id"))) ?? ,---- | This is another point that we break from typical database design | theory. In a typical database you?d look up a record by checking for | it in a B-tree and then go off to find the data pointer for its | record, which might have a continuation record at the end that you | have to look more stuff up in ? and so on. Our lookups are constant | time. We hash the string identifier to get an Index and then use that | Index to find the appropriate Offsets for its data. These vectors are | sequential on disk, rather than using continuation tables or something | similar, which makes constant time lookups possible. `---- Section "File-based Data Structures: Stack, Vector, Linear Hash" which doesn't sound entirely unlike the toy example of self resolving string-entity's using `initialize-context/``get-entity-in-context' which i posted here the other day: http://lists.common-lisp.net/pipermail/vivace-graph-devel/2011-September/000008.html > > Another goal of mine is to develop a native Lisp back-end that projects like > elephant might be able to take advantage of; I would like to interject that while vivace-graph-v2 may not be targetted as a full blown Persistent Object Store it does have the potential to re-think some of the cool functionality of Statice using SPROG instead of an object hierarchy: http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/statice.html > not relying on external, > non-Lisp libraries is a good thing, especially BerkeleyDB, given its > terrible licensing terms (thanks, Oracle). Its not just Oracle that have left BDB license in shambles... Regardless, I personally have a strong desire to keep integration with external (read non-lispy) tools to a minimum. From raison at chatsubo.net Wed Sep 14 05:15:56 2011 From: raison at chatsubo.net (Kevin Raison) Date: Tue, 13 Sep 2011 22:15:56 -0700 Subject: [vivace-graph-devel] elephant In-Reply-To: <4E6F81BF.5060001@chatsubo.net> References: <4E6F81BF.5060001@chatsubo.net> Message-ID: <4E70388C.7060408@chatsubo.net> A good paper on linear hashing to disk is attached. I will respond to Mon key's comments after some sleep... On 09/13/2011 09:15 AM, Kevin Raison wrote: > Dan, I actually already tried using elephant at a very early stage in > the development of VG. While elephant is an excellent library (which I > have used in many projects), it is simply too slow to be of use here. > One of the goals of VG is to be fast and to be able to handle billions > of triples. Elephant slows down very quickly because of a number of > factors, including its use of BerkeleyDB rather than a native Lisp > back-end store, as well as its complexity. VG does not need the level of > complexity or abstraction that you get with elephant's indexes and class > redefinition logic. In our case, we are dealing with one class, the > triple, and as such, we can be very specific about how we store and > index it as well as how we deal with it in memory. Standard b-trees > simply won't efficiently handle the fanout of a large triple store; we > need something specifically tuned to our purpose. I am fairly certain > that linear hashing for triple storage combined with b-tries or fb-trees > for indexing would do much better. There are other graph dbs out there > that use this strategy. See > http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ > for some good discussion. > > Another goal of mine is to develop a native Lisp back-end that projects > like elephant might be able to take advantage of; not relying on > external, non-Lisp libraries is a good thing, especially BerkeleyDB, > given its terrible licensing terms (thanks, Oracle). > > You mention that you had some further thoughts after reading the fractal > pre-fetching b-trees paper; care to share? > > -Kevin > > > On 09/13/2011 07:44 AM, Dan Lentz wrote: >> I've been thinking about persistent index strategies, and have read >> through the paper on fpb+trees, and have had a few thoughts. >> >> The first and simplest is to make use of elephant. Its not very exotic >> or course but it would allow a model in which triples can be first class >> objects, yet leverage a reasonably performant back end (bdb). In >> addition, the set-valued slots and association slots are nice >> abstractions on top of which to build the rdf semantics (properties, >> extensions) on top of a real clos mop. >> >> I figured I'd shoot the idea onto the mailing list to get a feel for the >> degree and nature of agreement/disagreement. >> >> >> _______________________________________________ >> vivace-graph-devel mailing list >> vivace-graph-devel at common-lisp.net >> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel > > _______________________________________________ > vivace-graph-devel mailing list > vivace-graph-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel -------------- next part -------------- A non-text attachment was scrubbed... Name: e_ds_linearhashing.pdf Type: application/pdf Size: 108658 bytes Desc: not available URL: From raison at chatsubo.net Wed Sep 14 05:30:53 2011 From: raison at chatsubo.net (Kevin Raison) Date: Tue, 13 Sep 2011 22:30:53 -0700 Subject: [vivace-graph-devel] elephant In-Reply-To: <4E70388C.7060408@chatsubo.net> References: <4E6F81BF.5060001@chatsubo.net> <4E70388C.7060408@chatsubo.net> Message-ID: <4E703C0D.3090306@chatsubo.net> And one more paper on linear hashing. On 09/13/2011 10:15 PM, Kevin Raison wrote: > A good paper on linear hashing to disk is attached. I will respond to > Mon key's comments after some sleep... > > On 09/13/2011 09:15 AM, Kevin Raison wrote: >> Dan, I actually already tried using elephant at a very early stage in >> the development of VG. While elephant is an excellent library (which I >> have used in many projects), it is simply too slow to be of use here. >> One of the goals of VG is to be fast and to be able to handle billions >> of triples. Elephant slows down very quickly because of a number of >> factors, including its use of BerkeleyDB rather than a native Lisp >> back-end store, as well as its complexity. VG does not need the level of >> complexity or abstraction that you get with elephant's indexes and class >> redefinition logic. In our case, we are dealing with one class, the >> triple, and as such, we can be very specific about how we store and >> index it as well as how we deal with it in memory. Standard b-trees >> simply won't efficiently handle the fanout of a large triple store; we >> need something specifically tuned to our purpose. I am fairly certain >> that linear hashing for triple storage combined with b-tries or fb-trees >> for indexing would do much better. There are other graph dbs out there >> that use this strategy. See >> http://blog.directededge.com/2009/02/27/on-building-a-stupidly-fast-graph-database/ >> >> for some good discussion. >> >> Another goal of mine is to develop a native Lisp back-end that projects >> like elephant might be able to take advantage of; not relying on >> external, non-Lisp libraries is a good thing, especially BerkeleyDB, >> given its terrible licensing terms (thanks, Oracle). >> >> You mention that you had some further thoughts after reading the fractal >> pre-fetching b-trees paper; care to share? >> >> -Kevin >> >> >> On 09/13/2011 07:44 AM, Dan Lentz wrote: >>> I've been thinking about persistent index strategies, and have read >>> through the paper on fpb+trees, and have had a few thoughts. >>> >>> The first and simplest is to make use of elephant. Its not very exotic >>> or course but it would allow a model in which triples can be first class >>> objects, yet leverage a reasonably performant back end (bdb). In >>> addition, the set-valued slots and association slots are nice >>> abstractions on top of which to build the rdf semantics (properties, >>> extensions) on top of a real clos mop. >>> >>> I figured I'd shoot the idea onto the mailing list to get a feel for the >>> degree and nature of agreement/disagreement. >>> >>> >>> _______________________________________________ >>> vivace-graph-devel mailing list >>> vivace-graph-devel at common-lisp.net >>> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel >> >> _______________________________________________ >> vivace-graph-devel mailing list >> vivace-graph-devel at common-lisp.net >> http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel > > > _______________________________________________ > vivace-graph-devel mailing list > vivace-graph-devel at common-lisp.net > http://lists.common-lisp.net/cgi-bin/mailman/listinfo/vivace-graph-devel -------------- next part -------------- A non-text attachment was scrubbed... Name: p195-ellis.pdf Type: application/pdf Size: 1883801 bytes Desc: not available URL: From monkey at sandpframing.com Wed Sep 14 15:09:06 2011 From: monkey at sandpframing.com (MON KEY) Date: Wed, 14 Sep 2011 11:09:06 -0400 Subject: [vivace-graph-devel] elephant In-Reply-To: <4E703C0D.3090306@chatsubo.net> References: <4E6F81BF.5060001@chatsubo.net> <4E70388C.7060408@chatsubo.net> <4E703C0D.3090306@chatsubo.net> Message-ID: Hi Kevin, Thanks for the links to the papers. I'm reviewing and finding them quite informative! Looks like I was really abusing the term "linear hashing" :( -- /s_P\ From danlentz at gmail.com Wed Sep 14 15:25:41 2011 From: danlentz at gmail.com (Dan Lentz) Date: Wed, 14 Sep 2011 11:25:41 -0400 Subject: [vivace-graph-devel] nodes, fixnums/upper bounds, and multi-constituent indices Message-ID: I am still reading though all the homework recommended in recent posts :) Really good stuff. I hope my questions are not a distraction from the important topics at hand but just contribute toward general discussion and (at least my) understanding of the project, its goals, how I can utilize VG and perhaps, in some way, try to contribute to the effort, if possible. part 1 Another topic I have been looking at related to the indexing and uuid's is the representation (reification?) of nodes, or lack thereof. One difference in vivace graph versus other tstores I've played with is the ability to reference nodes as first class "things". This is called a "node" in wilbur, and is represented by a simple object composed of the canonical identifier (uri-namestring) and a flag to indicate "resolution", which, for wilbur, indicates identification to a short/long namespace mapping, but I think the concept can be extended to also perhaps refer to hashing or other deferrable operations. In the Directed Edge model, nodes are apparently considered "Items" and have a somewhat richer archetype. In VG, this is not the case? Triples are (currently) represented by time based uuid as previously discussed, and nodes themselves are not hashed and indexed. Maybe this is going to change naturally in the course of moving to v5 uuid? part 2 This sort of blends into another indexing-model question, related to the current model which is based on a hierarchical index structure? Couldn't additional speed be achieved though multi-constituent indexing? IE and SP index, PO, index etc in which multiple nodes of a single triple are hashed in the aggregate to allow for direct lookup. This would of course decrease the upper bounds on the number of triples previously discussed if housed in a single-rooted index structure, as there would be (eventually) collisions between these incongruent indexing schemes. So maybe a multi-rooted index strategy is something that should be considered and incorporated early on. I think this is already partially implemented as spogi, gsopi, etc, but is still "single-constituent" hierarchical? As a concrete example -- in case my question has been as clear as mud :) -- i'd cite the cassandra-spoc-index-mediator of de.setf.resource, which leverages multi-constituent indexes extensively. Apologies (as usual) if I am missing something obvious or distracting from more useful conversation. Dan -------------- next part -------------- An HTML attachment was scrubbed... URL: From raison at chatsubo.net Wed Sep 14 20:01:29 2011 From: raison at chatsubo.net (Kevin Raison) Date: Wed, 14 Sep 2011 13:01:29 -0700 Subject: [vivace-graph-devel] elephant In-Reply-To: References: <4E6F81BF.5060001@chatsubo.net> Message-ID: <4E710819.6090107@chatsubo.net> Comments inline below. > Even with MMAPing there is there not still some significant overhead > associated with deserializing the MMAPPed data to something lispy? Yes, for persistence, this is unavoidable; the solution is heavy caching and interning of strings (VG already does this). I started work on some code a long time ago that worked like this: One mmap'ed linear hash file per graph, with the triple-id as hash key. The value slot is an offset (integer) into another mmap'ed file (or files) which is the actual triple storage area. Because triples are never truly deleted, only a simple memory allocator is needed for the triple storage file (append only). For indices, use some b-tree variant (possibly start with cl-btree: http://www.cliki.net/cl-btree or a b-trie: http://www.naskitis.com/naskitis-vldbj09.pdf) that maps keys to triple-ids. So a non-cached lookup via triple-id would hit the mmap'ed hash table, get an offset into the triple storage area, deserialize the triple into the cache and return the struct (or vector if we want to be really efficient). A search via an index would return a triple-id, which would hit the cache and return the triple or pass through to the hash table and repeat the deserialization process described above. AllegroGraph will use up as much memory as is available for caching, and with good reason; the more triples we keep in memory, the less slow-down we get. We might even have two layers of cache: cache triples themselves, as well as queries that map to offsets in the data file or to triple-ids. Also, since the hash table will be fairly small (integers -> integers), loading the whole thing into memory should be possible. >> its complexity. VG does not need the level of complexity or abstraction >> that you get with elephant's indexes and class redefinition logic. > > Naively, I would expect the MOP stuff to be a factor. Is it? Yes, who needs the MOP in this circumstance? I would prefer triples to be stored in memory as vectors (using the args to defstruct to force the use of vector storage) for the sake of efficiency. >> In our >> case, we are dealing with one class, the triple, and as such, we can be very >> specific about how we store and index it as well as how we deal with it in >> memory. Standard b-trees simply won't efficiently handle the fanout of a >> large triple store; we need something specifically tuned to our purpose. > > Why not? Think about what a triple really is: each S, P or O is a completely unique thing in the database. For the triple (Kevin likes cats), three symbols are created: 'Kevin, 'likes, and 'cats. Another triple, say (Kevin likes dogs) only creates one new symbol: 'dogs. There are not now two 'Kevin's in the db, but two triples that reference that one symbol. When indexing something like this in a btree, each symbol is a node in the tree, and each level of the btree would correspond to a slot in the triple; for example, to index in order of subject, predicate, object, the tree would be structured as S / \ P P / \ \ O O O Because nodes don't repeat and are atomic symbols, traversing the tree would be a linear search at each level. In a b-trie (http://www.naskitis.com/naskitis-vldbj09.pdf), you could effectively string the S, P and O together and create more reasonably branching search paths. For example, to index the two triples mentioned above with a third, (Kevin loves pizza), you would have a tree like: K / E / V / I / N / L / \ I O / \ K V / \ E E / \ S S / \ \ CATS DOGS PIZZA / \ \ ID ID ID This would also allow for substring matching in a very simple way. Read the paper for more details and a comparison to B+ trees. >> Another goal of mine is to develop a native Lisp back-end that projects like >> elephant might be able to take advantage of; > I would like to interject that while vivace-graph-v2 may not be > targetted as a full blown Persistent Object Store it does have the > potential to re-think some of the cool functionality of Statice using > SPROG instead of an object hierarchy: > > http://www.sts.tu-harburg.de/~r.f.moeller/symbolics-info/statice.html I don't have time to look at this right now; will revisit and comment later. >> not relying on external, >> non-Lisp libraries is a good thing, especially BerkeleyDB, given its >> terrible licensing terms (thanks, Oracle). > > Its not just Oracle that have left BDB license in shambles... > > Regardless, I personally have a strong desire to keep integration with > external (read non-lispy) tools to a minimum. YES! -K From monkey at sandpframing.com Wed Sep 14 20:35:40 2011 From: monkey at sandpframing.com (MON KEY) Date: Wed, 14 Sep 2011 16:35:40 -0400 Subject: [vivace-graph-devel] nodes, fixnums/upper bounds, and multi-constituent indices In-Reply-To: References: Message-ID: Hi Dan, On Wed, Sep 14, 2011 at 11:25 AM, Dan Lentz wrote: > I am still reading though all the homework recommended in recent posts :) Me too :) > Really good stuff. I'm learning alot as well. > I hope my questions are not a distraction from the important topics > at hand but just contribute toward general discussion and (at least > my) understanding of the project, its goals, how I can utilize VG > and perhaps, in some way, try to contribute to the effort, if > possible. I share many of the same questions and appreciate not having to ask them myself. Also, I've found it extremely useful to have recourse to the dialogues, discussions, questions, and answers on other archived common-lisp.net mailing lists esp. for projects with specs/API which were finalized years ago and the designers have since moved on, stopped active development, or are in maintenance only mode (Rucksack comes immediately to mind). > Another topic I have been looking at related to the indexing and > uuid's is the representation (reification?) of nodes, or lack > thereof. > > One difference in vivace graph versus other tstores I've played with > is the ability to reference nodes as first class "things". Maybe because they resolve to first class Lisp objects and don't resort to mediating the inferior objects spat out by lesser non-lispy sources :) No doubt this will eventually change once the VG2 transaction/persistence/indexing stuff is better established (hopefully sooner than later). > Another topic I have been looking at related to the indexing and uuid's is > the representation (reification?) of nodes, or lack thereof. VG2 is Kevin's baby and he's the boss, so I hope i'm not stepping on toes by interjecting. My impression is that implementations approaching SPOG triples tend to have some hard-wired implicit assumptions about the operational semantics of SPOG and that these assumptions are likely to yield a relatively constant subset of basic operations over the triples regardless of implementation. Which is to say, the basic idiom for how one might perform these operations is established (independent of whether VG currently implements them or not). Where VG2 might differ or deviate from other implementations is not w/r/t SPOG but rather SPOGI e.g. triple-id (and by proxy triple-indexes). > In VG, this is not the case? Triples are (currently) represented by time > based uuid as previously discussed, and nodes themselves are not hashed and > indexed. Maybe this is going to change naturally in the course of moving to > v5 uuid? IHMO it is not a given that a change to a namespacing UUID would necessarily change the existing VG2 assumptions. The role of UUIDs (potential and current) in VG2 is multi-faceted: v1 UUIDs are slow and if you don't require their time-stamping then a v4 UUID is a better solution if all that is really required is an anonymous but reasonably unique ID. There is no immediate gain to be had by using a namespaced (v3 or v5) UUID instead of an anonymous v4 UUID. In fact, there would be a loss in performance b/c there is more overhead associated with the minting of v3/v5 UUIDs. If you assume that any SPOGI implementation must concern itself with "namespacing" then there _may_ be some gain in using v3/v5 UUIDs instead of v4 UUIDs. Whether this is the case depends on how the system implements: - triple-id Whether the base identity is a string, integer, class-instance, etc. - triple-id-resolution How the _base_ representation of triple identity resolves to intermediate and higher-level representations - triple-indexing How base triple identity is indexed and how the indexed identities are resolved with their intermediate and higher-level representations - triple-persistence Whether the system can/should preserve state across sessions. If preserving state is a goal then the degree to which the data-structures employed for triple-indexing are in-memory bound or require disk-i/o becomes a factor. If the system can remain performant with only an in-memory footprint then the majority of persistence issues are moot. - triple-performance What is a reasonable upper bounds on the number of triples the system should expect to handle? Should system handle networked/distributed/concurrent access? Wow will implementations of networked/distributed/concurrent triple access scale. Obv. there are interdependencies among the set of considerations outlined above. >part 2 > > This sort of blends into another indexing-model question, related to > the current model which is based on a hierarchical index structure? > Couldn't additional speed be achieved though multi-constituent > indexing? IE and SP index, PO, index etc in which multiple nodes of > a single triple are hashed in the aggregate to allow for direct > lookup. Having spent some more time looking at the linear-hashing papers Kevin provided I'm have trouble see this as an either/or situation, e.g. underneath its all gonna eventually wind-up as hash-tables arrays, integers and de-referenced bucket/node/leaf/offset pointers :) I'm interested to learn from Kevin how much of the linear-hashing scheme he believes is already "built-in" to the existing VG-2 code-base. In particular, if most of the footwork for the linear-hashing work is already in place? And, if not whether there is some drop-in data-structure capable of implementing the linear-hashing scheme he envisions. And, if not what does he anticipate is required to implement a functional linear-hashing scheme as he envisions. > As a concrete example -- in case my question has been as clear as > mud :) -- i'd cite the cassandra-spoc-index-mediator of > de.setf.resource, which leverages multi-constituent indexes > extensively. My impression is that de.setf.resource has taken his approach b/c it is in large part a meta-library for CLOS<->RDF compatibility and the underlying constraints required to accommodate RDF require it. My read on quoted section below is that Anderson's quotation marks around "open-world" are meant as a mild slight on the RDF fanboys at W3C; in so much as (of itself) RDF is not capable of reasoning in either closed or open-world contexts. Regardless, following quote also provides some indication of how/why Anderson has made use of UUID w/r/t external resources, namely that the need for unique identities is as much a function of preserving transactional context as it is one of maintaining mappings of object identity equivalence. ,---- | Persistence Mediation | | Despite the RDF "open-world" paradigm, which requires a processing | mechanism accommodate unforseen data, it is imperative that a | repository mediator afford an application a stable projection of | unpredicatable content. If a CLOS application is to rely on class | and generic function definitions to behave as intended, they must be | bound to data as it appears, `de.setf.resource` serves this goal in | several ways: | - it implements instance identity within a given mediation interface | according to subject URI | | - it provides for automatic unique instance URI generation within a | transactional context | | - it treats symbols, universal names, and URI as equivalent | | - it accepts resources descriptions without nominal type indications, | reconciles them to the know class structure and admits additional | prototypical attributes. | | + instance identity, indexing, and caching | Each repository mediator adopts the respective repository's interened | URI 'nodes' as unifying identifiers to ensure a one-to-one relation | between identified objects and external resources. The URI serve as | keys in an hash table which is used in query operations to yield | identical instances for equivalent URI. The cache is not held weak, | as the repository's URI designator-to-node cache is itself static. | `---- :SOURCE de.setf.resource/resource-class.lisp FWIW with specific consideration to the future implementation details of VG2 I think approaching the semantics of SPOGI triples with an RDF-centric lens can only hamstring efforts b/c: a) The RDF model is mostly mimicking much pre-existing Lisp based kb/semantic-net/AI work so layering the RDF model on top of lisp is not unlike using the C programming language to implement Clisp and then using Clisp to implement the C programming language in Lisp... b) Working in the RDF model requires constant string wrangling This place a significant burden on Lisp to map the brain-dead syntax's/semantics of curly brace derived languages over Lisp's IOW Lisp-2's haven't directly conflated symbols with strings since MacLisp days... c) The RDF model generally seems to place more focus on the role of semantics around distribution of knowledge as a resource and less on the role of semantics of reasoning and deduction about the knowledge comprising a resource. This being said, I'm not knocking RDF, its stated goals, or its utility. Nor do i wish to cast aspersions on the real-world concerns that warrant an eventual focus on integrating with RDF as an attractive and laudable goal to promote for VG2 -- if only b/c "thars gold in dem hills..." and in general Lisp bums deserve more gold! I just personally hope that focusing on "how RDF does it" is not an immediate primary concern :) > Dan /s_P\ From monkey at sandpframing.com Thu Sep 15 01:48:03 2011 From: monkey at sandpframing.com (MON KEY) Date: Wed, 14 Sep 2011 21:48:03 -0400 Subject: [vivace-graph-devel] elephant In-Reply-To: <4E710819.6090107@chatsubo.net> References: <4E6F81BF.5060001@chatsubo.net> <4E710819.6090107@chatsubo.net> Message-ID: Hi Kevin, Thanks for your detailed response. >> Naively, I would expect the MOP stuff to be a factor. Is it? > Yes, who needs the MOP in this circumstance? No clue, I'm not suggesting it is needed :) > I would prefer triples to be stored in memory as vectors (using the args to > defstruct to force the use of vector storage) for the sake of efficiency. OK. Yes you can mark them read-only too. Also, on SBCL there is maybe some potential gains to be had with `sb-ext:freeze-type' > Think about what a triple really is: > each S, P or O is a completely unique thing in the database. I'm not entirely comfortable with that assertion. To the extent with which it is currently so I'm not convinced it can't be otherwise (e.g. with namespaces). I would think the S, P, O are only completely unique within some context: (let ((context 'outer)) (let ((s "FOO") (p "IS-A") (o "BAR")) (print context) (print (list s p o)) (let ((s "FOO") (p "IS-NOT-A") (o "BAR") (context 'inner) (new-foo '()) (new-bar '())) (setf new-foo s new-bar o) (print context) (flet ((test-var-mk3 (test var init) (apply test (list var (make-array 3 :element-type 'character :initial-contents init))))) (print `((,s ,(test-var-mk3 'eq s new-foo) ,(test-var-mk3 'eql s new-foo) ,(test-var-mk3 'equal s new-foo)) (,o ,(test-var-mk3 'eq o new-bar) ,(test-var-mk3 'eql o new-bar) ,(test-var-mk3 'equal o new-bar))))) (print (list s p o)) (setf s nil p nil o nil) (print (list s p o))) (print context) (print (list s p o)) (values))) > Because nodes don't repeat and are atomic symbols, traversing the tree > would be a linear search at each level. OK. Thank you for this explanation. I think i have been conflating the demands of triple indexing with the demands of dereferencing triples/graphs from the persistent store. This said, I'm not at all comfortable with the current explanation w/r/t the semiotics around uniqueness and the atomicity of symbols vs strings, and I am assuming that there _must_ be some level of indirection between objects denoted by the S, P, and O and the objects which identify these denotated objects. IOW, i'm assuming there are some over-simplifications around the whole sign/signifier/signified thang and that this is all well trodden territory for you and you're simply sparing us the ugly details :) > In a b-trie (http://www.naskitis.com/naskitis-vldbj09.pdf), you could > effectively string the S, P and O together and create more reasonably branching > search paths. For example, to index the two triples mentioned above with a > third, (Kevin loves pizza), you would have a tree like: I'm still reading this paper, although as yet I am completely failing to understanding how b-tries might easily accommodate namespace/context? >> Regardless, I personally have a strong desire to keep integration with >> external (read non-lispy) tools to a minimum. > YES! Great! this is really my #1 concern and interest and I should reiterate that next to a functional persistent VG2 all other details are secondary :) -- /s_P\ From danlentz at gmail.com Thu Sep 15 12:30:24 2011 From: danlentz at gmail.com (Dan Lentz) Date: Thu, 15 Sep 2011 08:30:24 -0400 Subject: [vivace-graph-devel] nodes, fixnums/upper bounds, and multi-constituent indices In-Reply-To: References: Message-ID: > > Another topic I have been looking at related to the indexing and > > uuid's is the representation (reification?) of nodes, or lack > > thereof. > > > > One difference in vivace graph versus other tstores I've played with > > is the ability to reference nodes as first class "things". > > Maybe because they resolve to first class Lisp objects and don't > resort to mediating the inferior objects spat out by lesser non-lispy > sources :) > > No doubt this will eventually change once the VG2 > transaction/persistence/indexing stuff is better established > (hopefully sooner than later). > > Ok, actually i worked out a pleasant node/namespace/package/symbol mapping automation last evening last night based on the "graph-words" which i'm using as the canonical node representation with lambdas (fdefinitions) and symbol-values to dereference and map back and forth. It's simply housekeeping but it's convenient, even more-so than the "bang reader" macro, so if there is any interest i'd be happy to post a more thorough description or code snippet. I like the graph-words convention and looked a little bit into combining the above with ContextL, which i believe could be used to nice effect in order to provide a conveniently namespace-aware symbol mapping like the above, but with symbols mapped based on dynamic context established with something like a "with-graphs" macro. ContextL is pretty fast at switching between these dynamic symbol mappings, although (important note) the packages (namespaces) themselves are static. The "contents" (symbols) are dynamic. > >part 2 > > > > This sort of blends into another indexing-model question, related to > > the current model which is based on a hierarchical index structure? > > Couldn't additional speed be achieved though multi-constituent > > indexing? IE and SP index, PO, index etc in which multiple nodes of > > a single triple are hashed in the aggregate to allow for direct > > lookup. > > Having spent some more time looking at the linear-hashing papers Kevin > provided I'm have trouble see this as an either/or situation, > e.g. underneath its all gonna eventually wind-up as hash-tables > arrays, integers and de-referenced bucket/node/leaf/offset pointers :) > > I'm interested to learn from Kevin how much of the linear-hashing > scheme he believes is already "built-in" to the existing VG-2 > code-base. > > In particular, if most of the footwork for the linear-hashing work is > already in place? > > And, if not whether there is some drop-in data-structure capable of > implementing the linear-hashing scheme he envisions. > > And, if not what does he anticipate is required to implement a > functional linear-hashing scheme as he envisions. > > So, in effect, multi-constituent indexing should be easy to tack on later down the road? -------------- next part -------------- An HTML attachment was scrubbed... URL: From danlentz at gmail.com Thu Sep 15 19:48:15 2011 From: danlentz at gmail.com (Dan Lentz) Date: Thu, 15 Sep 2011 15:48:15 -0400 Subject: [vivace-graph-devel] transactions on top of triples, scalability, and another storage alternative Message-ID: On the subject of billions of triples, one thing that comes to mind is that true scalability comes from the ability to operate in a federated model, on a possibly distributed store. This requires a transaction model that operates in a shared, multi-user scenario. One way to implement such a thing is *on-top* of triples. Do we have interest in this? I have in mind de.setf.resource, (do i sound like a broken record?) which defines such a methodology an implements it in such a way as to abstract over the differences between single repo and distributed repo models. By the way, as far as distributed stores, REDIS comes to mind as a far better alternative to cassandra. Now, of course, this does introduce a non-lisp component... ...BUT provides near infinite scalability and provides the capability of both remote and local storage configurations. Eg, press a button, deploy a simple REDIS server to EC2 and have near infinitely scalable graph storage with all the benefits of the hosted EC2. I've done some work with this and have found REDIS to be very good to work with via cl-redis. The downside is some per-query latency, and non-lispy backend. The upsides are many and also include bonuses such as pubsub queues and sorted sets, upon which it is easy to build many other structures out of triples, which, in turn, makes VG more broadly useful and the graph model an easier foundation to build on. you may now resume your normally scheduled booing and hissing :) -------------- next part -------------- An HTML attachment was scrubbed... URL: