[elephant-devel] Re: Derived Indicies

Ian Eslick eslick at media.mit.edu
Thu May 8 16:17:53 UTC 2008


On May 8, 2008, at 11:32 AM, Leslie P. Polzer wrote:

>
>> (defpclass person ()
>>   ((name ...)
>>    (inbox :accessor inbox :initform (make-indexed-btree message)
>> 	:index-on (date sender))))
>>
>> Which would create a indexed-btree that stored messages and auto-
>> created indices on the date and sender slots?
>>
>> That is a bit more OODB/Lispy than the association mechanism I've
>> implemented.  However I don't think you need to overload defpclass to
>> get this functionality.  In fact I think it starts to get too ugly.
>
>
> I was thinking of this, but the other way round; MESSAGE should  
> contain
> the usual index declarations:
>
> (defpclass message ()
>  ((date :index t) ; create an index on DATE
>   (sender :index t))) ; create an index on SENDER
>
> These indices would be created in the correct "indexing namespace"
> when the object is put into a container. This is easy to maintain
> with, say, an INDEXED-SET class. Whatever.

Can you be more explicit about what indexes you imagine being created  
here and where they are stored and how they are accessed?

> This should probably also imply that objects of the MESSAGE class
> are not automatically registered in the store controller's class
> root (as it is now); this should only happen on
>
> (defpclass message ()
>  (...)
> (:index t))


How do you know what to sort the 'inbox' sorted-set by?  Does it sort  
on message date or sender or both?

FYI, the class indexing function is now implicit, since elephant  
maintains a master list of oid->class-schema to support the schema  
mechanism which is itself an indexed btree.  I think I have an option  
to make this connection weak so the class index isn't updated  
automatically.

>
>> inbox should be an object (class instance or btree) that has its own
>> api.
>
> In fact that's my current solution but I don't like it much. It's not
> good for quick development. You need to think too much about the
> storage part.

I think we need to work through this model as an independent extension  
first.  I'm leery at this point of premature optimization given the  
cost in complexity, testing, and API changes that these kinds of  
changes incur.  (plus I'm not going to have time for a major upgrade  
for some time).

>
>> However associations, like psets, are not sorted (dup-btree
>> oid:instance-ref).  The value of (user message) is a persistent  
>> object
>> that is added to a dup-btree maintained by the metaclass protocol.   
>> It
>> maps the oid of a user to the messages that store it.
>
> All too complicated. IMHO a great feature of Elephant is that it let

(setf (user message) charlie) and (inbox charlie) is too complicated?   
Associations implements and maintains your set-of-objects model  
without requiring you to explicitly add objects to the appropriate  
container.  Seems like alot of gain for two lines of code!  Moreover,  
it limits the proliferation of btrees which makes life easier for  
postmodern given the table-per-btree implementation restriction.

> you work with your objects without worrying much about the storage
> backend. As of now, this feature still needs to get better.
> Sure, it works (at least for me), but it's not as nice as it could
> be. Elephant should take care of all the low-level sorting stuff
> (probably creating indices wherever needed or even sorting without
> indices for prototyping).

That would definitely be nice, but I'm not convinced the increased  
complexity is worth the benefit.  These things need to work robustly  
in the presence of migration, schema changes, multiple stores,  
transactions, the existing MOP, disconnected operations and the  
limitations of all the different data stores, all without adversely  
impacting the base performance.

One hard part in implementing something like this is telling the  
system how to hook into the slot-value and (setf slot-value) functions  
on individual class slots without incurring significant overhead.  The  
query system should do the right thing without these connections via  
joins and you can add the association declarations when you need  
better performance.  You could figure this out on the fly, of course,  
but building a large index can tie up the system for quite some time  
and you don't want that to happen randomly.

Of course, all what you propose is doable.

>
>> It would be reasonably cheap to do query caching of these sorted
>> OIDs so that subsequent OFFSET & LIMIT style accesses over the same
>> query set would be fast, just instantiating those messages that are  
>> needed.
>
> While I'm at it: OFFSET and LIMIT (a real limit which lets you specify
> an arbitrary Lisp expression) are things we definitely want to aim
> for in 1.0. They are not difficult to implement at all, but they don't
> work with GET-INSTANCES-BY-* and, worse, MAP-BTREE. This means
> everyone has to write their own version of these functions that
> take appropriate arguments and move the cursor around themselves
> instead of relying on a simple high-level API.

Can't you generalize this today as a higher order function that does  
this as a scan over an index, something like:

(map-inverted-index class index (offset-limit-scanner offset limit- 
fn) :oids t)

(defun offset-limit-scanner (offset limit-fn &optional (sc *store- 
controller*))
   (let ((count 0))
     (lambda (oid)
        (incf count)
        (when (> count offset)
	 (let ((instance (controller-recreate-instance sc oid)))
            (when (limit-fn instance)
               (stop-mapping))))))

In general, I believe these are things to think about for as roadmap  
items for 1.1 and 1.2.  They won't happen soon enough to justify  
delaying 1.0 for months.  As I said above, I think this should be a  
contrib that implements this behind a macro that generates the  
appropriate methods for a set of generic functions.  get-instances-by- 
class was always a convenience, not a catch-all.

> I'd have implemented these extensions myself, but I thought it better
> to wait for the integration of the query language to add it.

Well, don't hold your breath.  :)  Unless someone other than me picks  
up the query system work, it could be months before I get around to it.

>
>> The derived index hack is still more efficient for large sets.
>> Without changes to the data stores to create an efficient way of
>> sorting concatenated values, I don't see a way to improve on it  
>> easily.
>
> I'm not sure you actually need concatenated index values at all
> if you manage your objects correctly. I.e. putting them in appropriate
> containers (the natural OODB way) as opposed to throwing them all
> together in some indexing namespace and then tediously (for programmer
> and machine) selecting the stuff you need.

Hmmm...I'll have to think about that.

Good ideas here, let's keep the ideas coming and even better, see some  
contributions/extensions that implement this without impacting the MOP  
and all the ultimate automation we might desire.

>  Leslie
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel




More information about the elephant-devel mailing list