[elephant-devel] one experiential data point
Robert L. Read
read at robertlread.net
Sat Jan 14 20:45:54 UTC 2006
Thanks, Andrew, this is very interesting.
I would mention with respect to speed:
1) BerkeleyDB is much faster than the SQL implementations right now,
2) My base64-based serialization for the SQL backend could obviously
be improved tremendously.
Improving my serialization would could down on disk space and time in
the
SQL-back-end version a great deal. Even the BerkeleyDB serializer could
be
improved, although it is much better than what I am doing for the SQL-
back end.
I hope you will keep us informed as to your experience and your choices;
if
you choose not to use Elephant, that will be interesting information as
well.
For the substring searching, are you planning to use a "like" query in
MySQL
in your homegrown implementation? I doubt that Elephant can compete
with
that; although it can produce functional indexes that technically have
that power,
it would be pretty sophisticated programming to do it.
On Sat, 2006-01-14 at 12:09 -0800, Andrew Philpot wrote:
> Richard asked if I was still using Elephant. I thought I would send a
> quick note describing my experience using Elephant to persistently
> store a lisp-based ontology with multi-lingual lexical attachments.
>
> The "native" lisp implementation is pretty simple. Essentially, it is
> a semantic network implemented in Lisp using symbols for nodes and
> using property lists for (binary) relations. One additional wrinkle
> is that a given string might name more than one ontological element,
> most typically a concept and a possibly related, possibly unrelated
> English lexical item ("word"). Accordingly, symbols are interned in
> different CL packages according to their linguistic/ontological role
> (concept, lexical item, sense, and a few others); all packages
> containing a given fragment of the ontology are named per a convention
> allowing modular mixing and matching.
>
> A small schematic example:
>
> FRAG1-CONCEPT::|bank<side|
> :DEFINITION "side of a road or river"
> :HAS-SENSE (FRAG1-SENSE::|bank<side:EN:bank| FRAG1-SENSE::|bank<side:ES:banco
> |)
> :SUPER FRAG1-CONCEPT::|side<part|
> :SUB FRAG1-INSTANCE::|Left Bank|
> :SOURCE :WORDNET
> :PART-OF-SPEECH :NOUN
> :HAS-SUBJECT-DOMAIN DOM-CONCEPT::|engineering|
>
> FRAG1-SENSE::|bank<side:EN:bank|
> :HAS-CONCEPT FRAG1-CONCEPT:|bank<side|
> :HAS-LEXICAL-ITEM FRAG1-EN::|bank|
> :SOURCE :WORDNET
> :DERIVATION FRAG1-SENSE::|banked<modified::EN:banked|
>
> FRAG1-EN::|bank|
> :HAS-SENSE FRAG1-SENSE::|bank<side:EN:bank|
> :LANGUAGE :EN
> :SOURCE :WORDNET
>
> etc. etc. All in all there are roughly 600-700K of these named
> objects in my core ontology base, with millions more instances linked
> in below as well as other kinds of ancillary attachments.
>
> The above model using symbols for objects and and property lists for
> relations is not very fancy nor "modern", but it's very convenient and
> relatively lightweight for an in-core implementation. However,
> interned symbols don't get GC'd even when no one cares about them
> anymore, so the Lisp address size limits what I can do. Hence
> persistence a la Elephant.
>
> My current model for implementing this in Elephant:
>
> A. objects are 3-lists which serve as keys. FRAG1-EN::|bank| becomes
> (:frag1 :en "bank"). Algebraic in the sense that I can easily
> construct them a view them as a composition of smaller bits; nothing
> that can't be GC'd if needed (and the cardinality of namespaces and
> language keywords are both expected to be small).
>
> B. Relation network are hash-tables (EQL) keyed by the relation name.
>
> So the first one above is
>
> (add-from-root '(:frag1 :concept "bank<side")
> (let ((ht (make-hash-table :test #'eql)))
> (setf (gethash :DEFINITION ht) "side of a road or river")
> (setf (gethash :HAS-SENSE ht)
> (list '(:FRAG1 :SENSE "bank<side:EN:bank")
> '(:FRAG1 :SENSE "bank<side:ES:banco")))
> (setf (gethash :SUPER ht) '(:FRAG1 :CONCEPT "side<part"))
> (setf (gethash :SUB ht) '(:FRAG1 :INSTANCE "Left Bank"))
> (setf (gethash :SOURCE ht) :WORDNET)
> (setf (gethash :PART-OF-SPEECH ht) :NOUN)
> (setf (gethash :HAS-SUBJECT-DOMAIN ht) '(:DOM :CONCEPT "engineering"))))
>
> I'm reasonably happy with this implementation, but it does have some
> shortcomings:
>
> 1. Much slower than I expected. To load the above 600K entities into
> Lisp takes about 10 minutes; to load into elephant (presumably a
> one-time operation) takes about a day! Maybe transactions or wiser
> use of cursors or store controllers or what have you is called for,
> but I didn't expect this.
>
> 2. Takes a lot of space. Roughly 350 MB of Lisp source files turn
> into 3.6 GB of disk space. Again, maybe I'm doing something
> terribly wrong.
>
> I don't have data at this point about access times. I hope they don't
> reflect the load time.
>
> I'm still going forward on this. At this point, elephant is a long
> shot contender with a home grown RDBMS serialization back-ended in
> MySQL, which works but makes data updates laborious and installation a
> chore as well. To replicate the functionality of my other
> implementations, I will need some auxiliary indexing, in particular so
> I can ask for all objects whose names match a substring; and I will
> need some mechanism to handle relation inverses -- I want to say
> (:super A B) and be able to ask (:sub B A).
>
> It seems that a CLOS-based model might address some of the issues I
> just mentioned. I am kind of worried that it would be even slower.
> And of course, I've had some troubles with MOP. I'm not ruling this
> out.
>
> Any suggestions anyone might have would be welcome. When I have a
> web-based demo, I will also circulate it here.
>
> Thanks,
> Andrew Philpot
>
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/elephant-devel/attachments/20060114/37d18320/attachment.html>
More information about the elephant-devel
mailing list