[vivace-graph-devel] nodes, fixnums/upper bounds, and multi-constituent indices

Wed Sep 14 20:35:40 UTC 2011

Hi Dan,

On Wed, Sep 14, 2011 at 11:25 AM, Dan Lentz <danlentz at gmail.com> wrote:
> I am still reading though all the homework recommended in recent posts :)

Me too :)

> Really good stuff.

I'm learning alot as well.

> I hope my questions are not a distraction from the important topics
> at hand but just contribute toward general discussion and (at least
> my) understanding of the project, its goals, how I can utilize VG
> and perhaps, in some way, try to contribute to the effort, if
> possible.

I share many of the same questions and appreciate not having to ask
them myself. Also, I've found it extremely useful to have recourse to
the dialogues, discussions, questions, and answers on other archived
common-lisp.net mailing lists esp. for projects with specs/API which
were finalized years ago and the designers have since moved on,
stopped active development, or are in maintenance only mode (Rucksack
comes immediately to mind).

> Another topic I have been looking at related to the indexing and
> uuid's is the representation (reification?) of nodes, or lack
> thereof.
>
> One difference in vivace graph versus other tstores I've played with
> is the ability to reference nodes as first class "things".

Maybe because they resolve to first class Lisp objects and don't
resort to mediating the inferior objects spat out by lesser non-lispy
sources :)

No doubt this will eventually change once the VG2
transaction/persistence/indexing stuff is better established
(hopefully sooner than later).

> Another topic I have been looking at related to the indexing and uuid's is
> the representation (reification?) of nodes, or lack thereof.

VG2 is Kevin's baby and he's the boss, so I hope i'm not stepping on
toes by interjecting.

My impression is that implementations approaching SPOG triples tend to
have some hard-wired implicit assumptions about the operational
semantics of SPOG and that these assumptions are likely to yield a
relatively constant subset of basic operations over the triples
regardless of implementation. Which is to say, the basic idiom for how
one might perform these operations is established (independent of
whether VG currently implements them or not).

Where VG2 might differ or deviate from other implementations is not w/r/t
SPOG but rather SPOGI e.g. triple-id (and by proxy triple-indexes).

> In VG, this is not the case?  Triples are (currently) represented by time
> based uuid as previously discussed, and nodes themselves are not hashed and
> indexed.  Maybe this is going to change naturally in the course of moving to
> v5 uuid?

IHMO it is not a given that a change to a namespacing UUID would
necessarily change the existing VG2 assumptions.

The role of UUIDs (potential and current) in VG2 is multi-faceted:

 v1 UUIDs are slow and if you don't require their time-stamping then a
 v4 UUID is a better solution if all that is really required is an
 anonymous but reasonably unique ID.

 There is no immediate gain to be had by using a namespaced (v3 or v5)
 UUID instead of an anonymous v4 UUID. In fact, there would be a loss
 in performance b/c there is more overhead associated with the minting
 of v3/v5 UUIDs.

 If you assume that any SPOGI implementation must concern itself with
 "namespacing" then there _may_ be some gain in using v3/v5 UUIDs
 instead of v4 UUIDs. Whether this is the case depends on how the
 system implements:

  - triple-id
    Whether the base identity is a string, integer, class-instance, etc.

  - triple-id-resolution
    How the _base_ representation of triple identity resolves to
    intermediate and higher-level representations

  - triple-indexing
    How base triple identity is indexed and how the indexed identities
    are resolved with their intermediate and higher-level
    representations

  - triple-persistence
    Whether the system can/should preserve state across sessions.

    If preserving state is a goal then the degree to which the
    data-structures employed for triple-indexing are in-memory bound
    or require disk-i/o becomes a factor.

    If the system can remain performant with only an in-memory
    footprint then the majority of persistence issues are moot.

  - triple-performance

    What is a reasonable upper bounds on the number of triples the
    system should expect to handle?

    Should system handle networked/distributed/concurrent access?

    Wow will implementations of networked/distributed/concurrent
    triple access scale.

Obv. there are interdependencies among the set of considerations
outlined above.

>part 2
>
> This sort of blends into another indexing-model question, related to
> the current model which is based on a hierarchical index structure?
> Couldn't additional speed be achieved though multi-constituent
> indexing?  IE and SP index, PO, index etc in which multiple nodes of
> a single triple are hashed in the aggregate to allow for direct
> lookup.

Having spent some more time looking at the linear-hashing papers Kevin
provided I'm have trouble see this as an either/or situation,
e.g. underneath its all gonna eventually wind-up as hash-tables
arrays, integers and de-referenced bucket/node/leaf/offset pointers :)

I'm interested to learn from Kevin how much of the linear-hashing
scheme he believes is already "built-in" to the existing VG-2
code-base.

In particular, if most of the footwork for the linear-hashing work is
already in place?

And, if not whether there is some drop-in data-structure capable of
implementing the linear-hashing scheme he envisions.

And, if not what does he anticipate is required to implement a
functional linear-hashing scheme as he envisions.

> As a concrete example -- in case my question has been as clear as
> mud :) -- i'd cite the cassandra-spoc-index-mediator of
> de.setf.resource, which leverages multi-constituent indexes
> extensively.

My impression is that de.setf.resource has taken his approach b/c it
is in large part a meta-library for CLOS<->RDF compatibility and the
underlying constraints required to accommodate RDF require it.

My read on quoted section below is that Anderson's quotation marks
around "open-world" are meant as a mild slight on the RDF fanboys at
W3C; in so much as (of itself) RDF is not capable of reasoning in
either closed or open-world contexts. Regardless, following quote also provides
some indication of how/why Anderson has made use of UUID w/r/t
external resources, namely that the need for unique identities is as
much a function of preserving transactional context as it is one of
maintaining mappings of object identity equivalence.

,----
|  Persistence Mediation
|
|  Despite the RDF "open-world" paradigm, which requires a processing
|  mechanism accommodate unforseen data, it is imperative that a
|  repository mediator afford an application a stable projection of
|  unpredicatable content.  If a CLOS application is to rely on class
|  and generic function definitions to behave as intended, they must be
|  bound to data as it appears, `de.setf.resource` serves this goal in
|  several ways:
|  - it implements instance identity within a given mediation interface
|    according to subject URI
|
|  - it provides for automatic unique instance URI generation within a
|    transactional context
|
|  - it treats symbols, universal names, and URI as equivalent
|
|  - it accepts resources descriptions without nominal type indications,
|  reconciles them to the know class structure and admits additional
|  prototypical attributes.
|
|  + instance identity, indexing, and caching
|  Each repository mediator adopts the respective repository's interened
|  URI 'nodes' as unifying identifiers to ensure a one-to-one relation
|  between identified objects and external resources. The URI serve as
|  keys in an hash table which is used in query operations to yield
|  identical instances for equivalent URI.  The cache is not held weak,
|  as the repository's URI designator-to-node cache is itself static.
|
`---- :SOURCE de.setf.resource/resource-class.lisp

FWIW with specific consideration to the future implementation details
of VG2 I think approaching the semantics of SPOGI triples with an
RDF-centric lens can only hamstring efforts b/c:

 a) The RDF model is mostly mimicking much pre-existing Lisp based
    kb/semantic-net/AI work so layering the RDF model on top of lisp
    is not unlike using the C programming language to implement Clisp
    and then using Clisp to implement the C programming language in
    Lisp...

 b) Working in the RDF model requires constant string wrangling
    This place a significant burden on Lisp to map the brain-dead
    syntax's/semantics of curly brace derived languages over Lisp's

    IOW Lisp-2's haven't directly conflated symbols with strings since
    MacLisp days...

 c) The RDF model generally seems to place more focus on the role of
    semantics around distribution of knowledge as a resource and less
    on the role of semantics of reasoning and deduction about the
    knowledge comprising a resource.

This being said, I'm not knocking RDF, its stated goals, or its utility.
Nor do i wish to cast aspersions on the real-world concerns that
warrant an eventual focus on integrating with RDF as an attractive and
laudable goal to promote for VG2 -- if only b/c "thars gold in dem
hills..." and in general Lisp bums deserve more gold!

I just personally hope that focusing on "how RDF does it" is not an
immediate primary concern :)

> Dan

/s_P\