[elephant-devel] Severely poor performance in some obvious cases

Wed Nov 28 23:37:09 UTC 2007

Originally I wrote it get-instance-by-value as a convenience function  
when I knew that there was only one object being returned by an  
index.  If there are multiple objects that match the input term, then  
get-instance-by-value is effectively returning a random one.

There is going to be a huge performance difference on sets of this  
size between the BDB backend and the original CL-SQL backend because  
the SQL backend cannot support the ordering constraints of an index  
without reading all the objects into memory O(n) vs O(s) where s is  
the subset of the index size n that matches the input term where s can  
= 1.

I'm surprised you are seeing this difference on Postmodern, unless  
your input name is matching a large number of objects (i.e. s is  
large).  I had thought that the native PostgreSQL backend, postmodern,  
fixed the linear cost problem of CL-SQL indices and provides the same  
O(s) performance that BDB does, but I'm not certain (Henrik?  
Robert?).  It sounds like you are seeing a high linear cost.

As you point out, when the duplicate set is large, get-value on the  
secondary index does exactly the same thing without having to load in  
the whole duplicate set.  The fix you provided behaves the same and  
handles the case where you want a random member of a set and the set  
is large so there is no reason not to do it.

I'll go ahead and add this patch.

There is not SQL recording built into Elephant that I'm aware of,  
however Henrik or Robert should weigh in on this as it may be easy to  
add.

Cheers,
Ian

On Nov 28, 2007, at 3:53 AM, Alain Picard wrote:

>
> Dear Elephant developers,
>
> I've been considering using Elephant for a project of mine,
> and have been doing some basic performance tests, using the
> new postmodern back end (which seems way cool, btw).
>
> The scenario I'm testing is something like this; you
> have a base class:
>
> (defclass person-mixin ()
>  ((name :accessor person-name :initarg :name :index t))
>  (:metaclass persistent-metaclass))
>
> and a derived one:
>
> (defclass employee (person-mixin)
>  ((job       :accessor  job
>              :initarg  :job))
>  (:metaclass persistent-metaclass))
>
> And you go off an make a million instances of employees.
> [Let's say we're a very big corporation.  :-)]
>
> Then when I did the following:
>
> (time (get-instance-by-value 'employee 'name name))
>
> I was surprised to find that not only is it slow, but it conses
> like a madman.  This led me to inspect what this function actually
> does, and it turns out that it ends up doing a map-index, which
> does a with-btree-cursor to find a get-instances-by-value and
> then throws away all but the first.
>
> ;;; Current definition, in 0.9.1
>
> (defmethod get-instance-by-value ((class symbol) slot-name value)
>  (let ((list (get-instances-by-value (find-class class) slot-name  
> value)))
>    (when (consp list)
>      (car list))))
>
> (defmethod get-instance-by-value ((class persistent-metaclass) slot- 
> name value)
>  (let ((list (get-instances-by-value class slot-name value)))
>    (when (consp list)
>      (car list))))
>
> It seems odd to create a cursor to find something when you have an
> index on that slot.  Also, it seems to me users of
> GET-INSTANCE-BY-VALUE probably imagine there is only 1 instance to
> return; and so would there be a huge problem in using something like
> the following instead:
>
> ;;; Proposed definitions:
>
> (defmethod get-instance-by-value ((class persistent-metaclass) slot- 
> name value)
>  (let ((bt (find-inverted-index class slot-name)))
>    (if bt
> 	(get-value value bt) ; Do it the "simple" way
> 	(first (get-instances-by-value class slot-name value)))))
>
> (defmethod get-instance-by-value ((class symbol) slot-name value)
>  (get-instance-by-value (find-class class) slot-name value))
>
> This is more than a factor of 10 faster under elephant/postmodern
> for a class with 30,000 instances.
>
> Am I missing something really basic here?  Is there a simpler
> way to do what I want without this performance penalty?
> Will this simply not work for some other back ends I'm not
> aware of?  I feel a certain tension in the code trying to
> be "all things to all back-ends", and certain decisions are
> clearly inspired by the Berkeley DB back end, which sadly I
> could not use for the venture I have in mind (for licensing reasons).
>
>
> Lastly, is there a way to trace all the SQL commands going
> back and forth to postgresql in postmodern?  So far I've resorted
> to Postgres statement logging, which is painful to match up
> with what the application does.  I'm looking for the postmodern
> equivalent of CLSQL's START-SQL-RECORDING.
>
>
> Thanks in advance!
>
>                                Alain Picard
>
>
>
> -- 
> Please read about why Top Posting
> is evil at: http://en.wikipedia.org/wiki/Top-posting
> and http://www.dickalba.demon.co.uk/usenet/guide/faq_topp.html
>
> Please read about why HTML in email is evil at: http://www.birdhouse.org/etc/evilmail.html
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel