[elephant-devel] one experiential data point

Andrew Philpot philpot at ISI.EDU
Sat Jan 14 20:09:35 UTC 2006


Richard asked if I was still using Elephant.  I thought I would send a
quick note describing my experience using Elephant to persistently
store a lisp-based ontology with multi-lingual lexical attachments.

The "native" lisp implementation is pretty simple.  Essentially, it is
a semantic network implemented in Lisp using symbols for nodes and
using property lists for (binary) relations.  One additional wrinkle
is that a given string might name more than one ontological element,
most typically a concept and a possibly related, possibly unrelated
English lexical item ("word").  Accordingly, symbols are interned in
different CL packages according to their linguistic/ontological role
(concept, lexical item, sense, and a few others); all packages
containing a given fragment of the ontology are named per a convention
allowing modular mixing and matching.

A small schematic example:

FRAG1-CONCEPT::|bank<side|
  :DEFINITION "side of a road or river"
  :HAS-SENSE (FRAG1-SENSE::|bank<side:EN:bank| FRAG1-SENSE::|bank<side:ES:banco
|)
  :SUPER FRAG1-CONCEPT::|side<part|
  :SUB FRAG1-INSTANCE::|Left Bank|
  :SOURCE :WORDNET
  :PART-OF-SPEECH :NOUN
  :HAS-SUBJECT-DOMAIN DOM-CONCEPT::|engineering|

FRAG1-SENSE::|bank<side:EN:bank|
  :HAS-CONCEPT FRAG1-CONCEPT:|bank<side|
  :HAS-LEXICAL-ITEM FRAG1-EN::|bank|
  :SOURCE :WORDNET
  :DERIVATION FRAG1-SENSE::|banked<modified::EN:banked|

FRAG1-EN::|bank|
  :HAS-SENSE FRAG1-SENSE::|bank<side:EN:bank|
  :LANGUAGE :EN
  :SOURCE :WORDNET

etc. etc.  All in all there are roughly 600-700K of these named
objects in my core ontology base, with millions more instances linked
in below as well as other kinds of ancillary attachments.

The above model using symbols for objects and and property lists for
relations is not very fancy nor "modern", but it's very convenient and
relatively lightweight for an in-core implementation.  However,
interned symbols don't get GC'd even when no one cares about them
anymore, so the Lisp address size limits what I can do.  Hence
persistence a la Elephant.

My current model for implementing this in Elephant:

A. objects are 3-lists which serve as keys.  FRAG1-EN::|bank| becomes
(:frag1 :en "bank").  Algebraic in the sense that I can easily
construct them a view them as a composition of smaller bits; nothing
that can't be GC'd if needed (and the cardinality of namespaces and
language keywords are both expected to be small).

B. Relation network are hash-tables (EQL) keyed by the relation name.

So the first one above is 

(add-from-root '(:frag1 :concept "bank<side")
   (let ((ht (make-hash-table :test #'eql)))
     (setf (gethash :DEFINITION ht) "side of a road or river")
     (setf (gethash :HAS-SENSE ht) 
       (list '(:FRAG1 :SENSE "bank<side:EN:bank")
             '(:FRAG1 :SENSE "bank<side:ES:banco")))
     (setf (gethash :SUPER ht) '(:FRAG1 :CONCEPT "side<part"))
     (setf (gethash :SUB ht) '(:FRAG1 :INSTANCE "Left Bank"))
     (setf (gethash :SOURCE ht) :WORDNET)
     (setf (gethash :PART-OF-SPEECH ht) :NOUN)
     (setf (gethash :HAS-SUBJECT-DOMAIN ht) '(:DOM :CONCEPT "engineering"))))

I'm reasonably happy with this implementation, but it does have some
shortcomings:

1. Much slower than I expected.  To load the above 600K entities into
   Lisp takes about 10 minutes; to load into elephant (presumably a
   one-time operation) takes about a day!  Maybe transactions or wiser
   use of cursors or store controllers or what have you is called for,
   but I didn't expect this.

2. Takes a lot of space.  Roughly 350 MB of Lisp source files turn
   into 3.6 GB of disk space.  Again, maybe I'm doing something
   terribly wrong.

I don't have data at this point about access times.  I hope they don't
reflect the load time.

I'm still going forward on this.  At this point, elephant is a long
shot contender with a home grown RDBMS serialization back-ended in
MySQL, which works but makes data updates laborious and installation a
chore as well.  To replicate the functionality of my other
implementations, I will need some auxiliary indexing, in particular so
I can ask for all objects whose names match a substring; and I will
need some mechanism to handle relation inverses -- I want to say
(:super A B) and be able to ask (:sub B A).

It seems that a CLOS-based model might address some of the issues I
just mentioned.  I am kind of worried that it would be even slower.
And of course, I've had some troubles with MOP.  I'm not ruling this
out.

Any suggestions anyone might have would be welcome.  When I have a
web-based demo, I will also circulate it here.

Thanks,
Andrew Philpot





More information about the elephant-devel mailing list