[elephant-devel] Complex queries

Mon Nov 13 05:42:02 UTC 2006

Thanks. I'll be great to contribute anything to the project and a  
pleasure working with you guys.

On Nov 12, 2006, at 1:11 PM, Ian Eslick wrote:

> I'm interested in this problem myself, but haven't had the time to do
> the appropriate research.  Many SQL query engines are built on top  
> of a
> BTree infrastructure so building and performing optimized queries  
> should
> be quite doable in Elephant.  However, I do know that efficient
> operations are fairly complicated.
>
> I'd be happy to support your work by including a search interface +
> simple compiler into the main Elephant release.
>
> I'm assuming you want to search for persistent objects and not  
> standard
> objects stored in BTrees.

Correct. Just persistent objects

>
> There are two ways to extract a set of objects associated with a
> particular slot value:
> 1) Get all objects of the desired type and do a linear search
> 2) Create a slot-index for searchable slots.  You can use the accessor
> functions in get-instances-by-value and get-instances-by-range.  This
> returns the persistent object.  Current cursor accesses require full
> deserialization of the target object but fortunately this doesn't
> include the slot, just the oid+classname, etc.

Pardon my lisp ignorance, but is there a way to introspect an object/ 
class in lisp and find out whether or not a slot is "indexable"? That  
way, maybe whatever code is querying for data can take advantage of  
any slot-index before defaulting to linear searching.

>
> If you extract a range of objects using a slot index, they'll  
> already be
> ordered according to the sort order of the index values.  You can do
> joins by filtering each list by whether the object exists in the next
> list or not.
>
> i.e.
>
> age = 18 (oid1, oid2, oid3, oid4, oid5, oid6, oid7 ... )
> name = "Cindy"  (oid4, oid7, oid3 ...)
> state = MA  (oid10, oid4, oid7 ...)
>
> Filtered objects = oid4, oid7 ...
>

I would assume that you might have meant intersection instead of  
join? Right?

> Optimization issues:
>
> 1) Returned objects are in value order, not OID or object order; so  
> each
> merge of lists is naively C*N^2 where C is the number of  
> constraints (3
> above) and N is the average size of a returned list.  This can be
> avoided by sorting each list (N log N) prior to merging which is
> effectively M log M (where M = the sum of all returned objects for all
> the constraints).  In fact if you append the lists (N) and then sort
> removing duplicates M log M you get the same result.  You then need
> another N log N walk of the list to order the elements according to  
> the
> original list of constraints.
>
> I'm sure there's a better way, but this is a decent first order take.
>

It's too late at night for me to compute/validate this Big O notation  
this but will confirm/comment tomorrow :)

> 2) Now this works fine when the # of objects in a given query are
> small.  But let's say you want to pick all adults with the name  
> "Jerry"
> from a 1M entry database of people.  This is a range query
> (get-instances-by-range 'person 'age 18 120) and a value query
> (get-instances-by-value 'person 'first-name "Jerry").  However, the  
> # of
> Jerry instances and in particular the # of age > 18 instances will be
> enormous (at least 500k).  You might not even be able to fit all of
> these in memory!
>

This is exactly my main concern. Specially when, for example, the age  
of a person is a slot of a class called demographic to which the  
class person has a relationship.

> In this case we need to know how large the sets of objects are under
> given constraints.  (I'm not sure if BDB provides a way to get  
> access to
> the size of the index between two values - the total size of the  
> tree +
> the height of the nodes should be telling as the BTrees are
> balanced...).  In this case we can try to use cursors or slot  
> accessors
> to do the filtering based on the total size and accessibility of the
> queries.  At this point it gets somewhat complicated and I think it's
> worthwhile to appeal to the SQL implementation literature.

Agree. I'll see if I can research any SQL implementation on how they  
compute this. After all, SQL is all about boolean math, so they have  
to have a way to easily compute this.

> Also, what happens if a slot is not indexed?  How do we search  
> multiple
> object classes at the same time?  For example, how would we specify a
> search over instances and it's super classes when the superclass also
> contain the slot constraints?  (Currently class indexes are  
> specific to
> a class name)

I suppose we'll find the answer of this somewhere in a SQL  
implementation, unless anyone already knows and can contribute that  
knowledge (I guess those are the pros and cons of going to an object  
database rather than a relational database).

>
>
> Is this helpful?  A first cut approach as described in #1 above will
> probably flesh out a large number of issues and shouldn't take that  
> much
> time.  If you are motivated, or have large queries, then it's
> appropriate to spend some time talking about how to handle the  
> issues of
> large sets in #2.
>

I think #2 is where we (and I don't mean just my team) want to go  
since it will provide the most flexibility.

> Query language:
>
> Does it make sense to have a full query language, or a simpler  
> macros to
> make this simple?  The easy macro approach probably doesn't need
> anything sophisticated:
>
> (retrieve-instances-by-constraints ((person :with-superclasses t)  
> employee)
>     ((age 18 120)
>      (name "George")
>      (state :MA))
>     (:symbols-as-strings t :order t))
>
> Here we search for person objects, parents of person objects and
> employee objects for the three constraints.  The search option
> :symbols-as-strings means to treat :MA as matching :MA or "MA" - that
> means two slot queries.  Not sure that's a good idea, but it's an
> example of modifying how the query operates.  :order means to order  
> the
> resulting objects by their age, in this case.  The order of ranges in
> the constraint list would determine the total order.
>

IMHO, macros are just fine. You can always repackage them in a  
separate domain language.

> What if we want to do join queries?
> How about OR constraints?  (state (or :MA :NY))
>

I have the same questions.

>
> Ian
>

Thanks,
Daniel