[elephant-devel] Querying Advice

Ian Eslick eslick at csail.mit.edu
Mon Nov 13 05:18:52 UTC 2006


>>
>> First of all, ask yourself, what is the size of your dataset?  Can
>> you fit it all into memory?
>> If so, you have the full power of lisp at your command in dealing
>> with the querying.  You
>> will not have to write any macros to do this.  You might find the DCM
>> package, in the "contrib"
>> directory, a useful package, although it does not address querying;
>> it is more of a cache handling
>> issue.  (DCM has only been tested under SBCL, as far as I know.)
>
> In general, I do think that the dataset fits in memory. However, we
> have not fully loaded all of our data into Elephant. Simply looking at
> the MySQL file storage for the database, it occupies 900MB of disk
> space, including indices. However, when we did loads of some of the
> tables into Elephant, the size of the Elephant data files were, at
> least, 5 times the size. I don't know the reason why. I don't know if
> that's the nature of BDB when it stores "arbitrarily" any type of
> object in the k,v pair. I don't know if we had some circular
> references when we loaded our model and that just simply increased the
> data file by that much (although I wouldn't think so, since if that
> was the case, I would expect that it only stored references to objects
> and not duplicating the objects). Regardless, our dev server currently
> has 4GB of RAM. I think that once properly loaded, all the data should
> be able to fit in memory.
Were you storing persistent-metaclass objects or simply normal objects? 
Normal objects are huge relative to persistent objects as currently all
the slot names are also serialized (potentially in 32-bit unicode if on
SBCL).  I have some improvements planned to reduce the storage and
processing overhead for strings, but it will take some evaluation to
make sure performance doesn't suffer over much.  Some of this will make
it into the forthcoming 0.6.1 release I'm trying to finish a bunch of
work for prior to my twin daughters are born. 

Also SBCL strings in the 0.6.0 release are stored in their internal
32-bit form so we can memcpy them back and forth between elephant and
the internal format.  The new serializer will store them in the minimal
format and then convert back to the 32-bit internal form for unicode
enabled SBCL.
> Now, for the target application, I prefer not to rely in the data
> fitting in memory. Reason being is that the nature of the application
> requires the data to be available for several years. This 900MB is the
> results of only 1 year for one company. As we get more companies to
> use the application and keep the data online for several years, the
> assumption of the data fitting into memory will no longer be applicable.
See my delayed e-mail on a hybrid query approach...
> I have been looking into the DCM package and I think that it certainly
> looks promising. We haven't used it yet, but certainly hope that
> sooner, rather than later, will be made part of Elephant permanently.
> I also hope that it's not targeted mainly at the in-memory database
> type of application, but rather, as an efficient caching mechanism for
> persistent data (regardless of where it's being permanently stored).
DCM is, as I understand it, a form of prevalence with explicit
checkpointing using Elephant as the backend.
> With regards to: "...If so, you have the full power of lisp at your
> command in dealing with the querying...", I agree with you. However,
> where I'm trying to get at is how "easy" would it be to generate these
> type of dynamic queries in a generic way. Of course, we could always
> hard code all the cases for each of our different searchable screens,
> but the thought of that simply just makes me vomit :)
Associate a constraint on the display with the user action for
selections and an ordering function for specific columns.  If you want
to sort by multiple columns, then compose your functions:

Selection constraint fn:
#'(lambda (obj) (= (slot-accessor obj) value)

Sorting fn:
#'(lambda (obj1 obj2) (string< (name obj1) (name obj2)))
#'(lambda (obj1 obj2) (< (age obj1) (age obj2)))

Composed sorting function:
(defun compose-sorts (sort1 sort2)
    (lambda (o1 o2)
        (or (funcall sort1 o1 o2)
             (funcall sort2 o1 o2))))

Same thing with constraints:
(defun compose-constraint (c1 c2)
    (lambda (obj)
        (and (funcall c1 obj)
                (funcall c2 obj))))

Or to be a little more fancy:
(defun compose-constraints (clist)
    (lambda (obj)
        (every #'(lambda (c) (funcall c obj)) clist)))

(defun compose-sorts (slist)
    (lambda (o1 o2)
        (some #'(lambda (sort) (funcall sort o1 o2)))))

This all comes down to:

(defun select-results (objects)
    (select-if (compose-constraints (make-name-constraints "Bob")
(make-age-constraint 18 120))
                  objects))                

(defun order-results (objects)
    (sort (copy-list objects)
            (compose-sorts #'by-occupation #'by-name #'by-date-of-birth)))

This way you can have a set of widgets with static constraints and then
you can compose functions over the constraints based on the dynamic
query.  There may be more elegant ways to do this but I think this is
what Robert was driving at.

To do this on elephant simply requires that you (get-instances-by-class
'person) to get all instances of person and then run the lisp query over
it.  If you access person alot, these persistent references will be
cached in memory.  Or they're all in the cache including slot values in
DCM - in which case these queries will run really fast!

>>
>> Under SBCL, when it comes to sorting you have "sort" and
>> "stable-sort"; I think these are build in.
>> I'll eat a candle if you don't find them to blazingly fast (although
>> the predicates that you pass them
>> might take some time.)
>
> I thought the answer to my sorting question is exactly addressed by
> your comment. I suppose that once I have the resulting dataset, I
> could run it by "sort". They key would be how to make it arbitrarily
> sortable (in a similar way as the dynamic query)
Sorting should be easy to do in this manner once you've found your
target group of objects and when you've cached all the slot values
in-memory DCM style.  My proposal includes some sorting in the retrieval
to avoid accessing and serializing the slot values. 

Every step of a cursor over a secondary index deserializes the key, the
primary BTree key and the value (the persistent object record) from that
primary BTree.  This does not deserialize the slot values so is
reasonably fast.
>>
>> I think really the only good way to answer this question in a deeper
>> way is to provide some
>> example code.  I do exactly what you are talking about in my
>> application (http://konsenti.com),
>> (although I use DCM), so I ought to be able to produce an example
>> program relatively quickly.
>> You'll have to figure out how to map the GUI into those requests
>> yourself, however.
>>
>
> I believe (and hope) that we should have no problem mapping the GUI to
> the requests. Just out of curiosity (and I don't mean to divert from
> the topic of this thread): if you're using DCM for your konsenti (BTW,
> nice concept) site, how do you protect your in-memory data? Do you
> just write an image to disk every once in a while for back ups? How
> resilient is this to hardware failure and you loosing data since the
> last image (if that's your approach)?
>
Every time you finish or 'commit' a user transaction (form submit), just
write the object data back to the Elephant persistent store before
acking to the user.


Ian



More information about the elephant-devel mailing list