[elephant-devel] one experiential data point

Sat Jan 14 20:45:54 UTC 2006

Thanks, Andrew, this is very interesting.

I would mention with respect to speed:

1)  BerkeleyDB is much faster than the SQL implementations right now,
2)  My base64-based serialization for the SQL backend could obviously 
be improved tremendously.

Improving my serialization would could down on disk space and time in
the 
SQL-back-end version a great deal.  Even the BerkeleyDB serializer could
be
improved, although it is much better than what I am doing for the SQL-
back end.

I hope you will keep us informed as to your experience and your choices;
if 
you choose not to use Elephant, that will be interesting information as
well.

For the substring searching, are you planning to use a "like" query in
MySQL
in your homegrown implementation?   I doubt that Elephant can compete
with 
that; although it can produce functional indexes that technically have
that power,
it would be pretty sophisticated programming to do it.

On Sat, 2006-01-14 at 12:09 -0800, Andrew Philpot wrote:

> Richard asked if I was still using Elephant.  I thought I would send a
> quick note describing my experience using Elephant to persistently
> store a lisp-based ontology with multi-lingual lexical attachments.
> 
> The "native" lisp implementation is pretty simple.  Essentially, it is
> a semantic network implemented in Lisp using symbols for nodes and
> using property lists for (binary) relations.  One additional wrinkle
> is that a given string might name more than one ontological element,
> most typically a concept and a possibly related, possibly unrelated
> English lexical item ("word").  Accordingly, symbols are interned in
> different CL packages according to their linguistic/ontological role
> (concept, lexical item, sense, and a few others); all packages
> containing a given fragment of the ontology are named per a convention
> allowing modular mixing and matching.
> 
> A small schematic example:
> 
> FRAG1-CONCEPT::|bank<side|
>   :DEFINITION "side of a road or river"
>   :HAS-SENSE (FRAG1-SENSE::|bank<side:EN:bank| FRAG1-SENSE::|bank<side:ES:banco
> |)
>   :SUPER FRAG1-CONCEPT::|side<part|
>   :SUB FRAG1-INSTANCE::|Left Bank|
>   :SOURCE :WORDNET
>   :PART-OF-SPEECH :NOUN
>   :HAS-SUBJECT-DOMAIN DOM-CONCEPT::|engineering|
> 
> FRAG1-SENSE::|bank<side:EN:bank|
>   :HAS-CONCEPT FRAG1-CONCEPT:|bank<side|
>   :HAS-LEXICAL-ITEM FRAG1-EN::|bank|
>   :SOURCE :WORDNET
>   :DERIVATION FRAG1-SENSE::|banked<modified::EN:banked|
> 
> FRAG1-EN::|bank|
>   :HAS-SENSE FRAG1-SENSE::|bank<side:EN:bank|
>   :LANGUAGE :EN
>   :SOURCE :WORDNET
> 
> etc. etc.  All in all there are roughly 600-700K of these named
> objects in my core ontology base, with millions more instances linked
> in below as well as other kinds of ancillary attachments.
> 
> The above model using symbols for objects and and property lists for
> relations is not very fancy nor "modern", but it's very convenient and
> relatively lightweight for an in-core implementation.  However,
> interned symbols don't get GC'd even when no one cares about them
> anymore, so the Lisp address size limits what I can do.  Hence
> persistence a la Elephant.
> 
> My current model for implementing this in Elephant:
> 
> A. objects are 3-lists which serve as keys.  FRAG1-EN::|bank| becomes
> (:frag1 :en "bank").  Algebraic in the sense that I can easily
> construct them a view them as a composition of smaller bits; nothing
> that can't be GC'd if needed (and the cardinality of namespaces and
> language keywords are both expected to be small).
> 
> B. Relation network are hash-tables (EQL) keyed by the relation name.
> 
> So the first one above is 
> 
> (add-from-root '(:frag1 :concept "bank<side")
>    (let ((ht (make-hash-table :test #'eql)))
>      (setf (gethash :DEFINITION ht) "side of a road or river")
>      (setf (gethash :HAS-SENSE ht) 
>        (list '(:FRAG1 :SENSE "bank<side:EN:bank")
>              '(:FRAG1 :SENSE "bank<side:ES:banco")))
>      (setf (gethash :SUPER ht) '(:FRAG1 :CONCEPT "side<part"))
>      (setf (gethash :SUB ht) '(:FRAG1 :INSTANCE "Left Bank"))
>      (setf (gethash :SOURCE ht) :WORDNET)
>      (setf (gethash :PART-OF-SPEECH ht) :NOUN)
>      (setf (gethash :HAS-SUBJECT-DOMAIN ht) '(:DOM :CONCEPT "engineering"))))
> 
> I'm reasonably happy with this implementation, but it does have some
> shortcomings:
> 
> 1. Much slower than I expected.  To load the above 600K entities into
>    Lisp takes about 10 minutes; to load into elephant (presumably a
>    one-time operation) takes about a day!  Maybe transactions or wiser
>    use of cursors or store controllers or what have you is called for,
>    but I didn't expect this.
> 
> 2. Takes a lot of space.  Roughly 350 MB of Lisp source files turn
>    into 3.6 GB of disk space.  Again, maybe I'm doing something
>    terribly wrong.
> 
> I don't have data at this point about access times.  I hope they don't
> reflect the load time.
> 
> I'm still going forward on this.  At this point, elephant is a long
> shot contender with a home grown RDBMS serialization back-ended in
> MySQL, which works but makes data updates laborious and installation a
> chore as well.  To replicate the functionality of my other
> implementations, I will need some auxiliary indexing, in particular so
> I can ask for all objects whose names match a substring; and I will
> need some mechanism to handle relation inverses -- I want to say
> (:super A B) and be able to ask (:sub B A).
> 
> It seems that a CLOS-based model might address some of the issues I
> just mentioned.  I am kind of worried that it would be even slower.
> And of course, I've had some troubles with MOP.  I'm not ruling this
> out.
> 
> Any suggestions anyone might have would be welcome.  When I have a
> web-based demo, I will also circulate it here.
> 
> Thanks,
> Andrew Philpot
> 
> 
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/elephant-devel/attachments/20060114/37d18320/attachment.html>