[elephant-devel] Complex queries
Ian Eslick
eslick at csail.mit.edu
Sun Nov 12 18:11:40 UTC 2006
I'm interested in this problem myself, but haven't had the time to do
the appropriate research. Many SQL query engines are built on top of a
BTree infrastructure so building and performing optimized queries should
be quite doable in Elephant. However, I do know that efficient
operations are fairly complicated.
I'd be happy to support your work by including a search interface +
simple compiler into the main Elephant release.
I'm assuming you want to search for persistent objects and not standard
objects stored in BTrees.
There are two ways to extract a set of objects associated with a
particular slot value:
1) Get all objects of the desired type and do a linear search
2) Create a slot-index for searchable slots. You can use the accessor
functions in get-instances-by-value and get-instances-by-range. This
returns the persistent object. Current cursor accesses require full
deserialization of the target object but fortunately this doesn't
include the slot, just the oid+classname, etc.
If you extract a range of objects using a slot index, they'll already be
ordered according to the sort order of the index values. You can do
joins by filtering each list by whether the object exists in the next
list or not.
i.e.
age = 18 (oid1, oid2, oid3, oid4, oid5, oid6, oid7 ... )
name = "Cindy" (oid4, oid7, oid3 ...)
state = MA (oid10, oid4, oid7 ...)
Filtered objects = oid4, oid7 ...
Optimization issues:
1) Returned objects are in value order, not OID or object order; so each
merge of lists is naively C*N^2 where C is the number of constraints (3
above) and N is the average size of a returned list. This can be
avoided by sorting each list (N log N) prior to merging which is
effectively M log M (where M = the sum of all returned objects for all
the constraints). In fact if you append the lists (N) and then sort
removing duplicates M log M you get the same result. You then need
another N log N walk of the list to order the elements according to the
original list of constraints.
I'm sure there's a better way, but this is a decent first order take.
2) Now this works fine when the # of objects in a given query are
small. But let's say you want to pick all adults with the name "Jerry"
from a 1M entry database of people. This is a range query
(get-instances-by-range 'person 'age 18 120) and a value query
(get-instances-by-value 'person 'first-name "Jerry"). However, the # of
Jerry instances and in particular the # of age > 18 instances will be
enormous (at least 500k). You might not even be able to fit all of
these in memory!
In this case we need to know how large the sets of objects are under
given constraints. (I'm not sure if BDB provides a way to get access to
the size of the index between two values - the total size of the tree +
the height of the nodes should be telling as the BTrees are
balanced...). In this case we can try to use cursors or slot accessors
to do the filtering based on the total size and accessibility of the
queries. At this point it gets somewhat complicated and I think it's
worthwhile to appeal to the SQL implementation literature.
Also, what happens if a slot is not indexed? How do we search multiple
object classes at the same time? For example, how would we specify a
search over instances and it's super classes when the superclass also
contain the slot constraints? (Currently class indexes are specific to
a class name)
Is this helpful? A first cut approach as described in #1 above will
probably flesh out a large number of issues and shouldn't take that much
time. If you are motivated, or have large queries, then it's
appropriate to spend some time talking about how to handle the issues of
large sets in #2.
Query language:
Does it make sense to have a full query language, or a simpler macros to
make this simple? The easy macro approach probably doesn't need
anything sophisticated:
(retrieve-instances-by-constraints ((person :with-superclasses t) employee)
((age 18 120)
(name "George")
(state :MA))
(:symbols-as-strings t :order t))
Here we search for person objects, parents of person objects and
employee objects for the three constraints. The search option
:symbols-as-strings means to treat :MA as matching :MA or "MA" - that
means two slot queries. Not sure that's a good idea, but it's an
example of modifying how the query operates. :order means to order the
resulting objects by their age, in this case. The order of ranges in
the constraint list would determine the total order.
What if we want to do join queries?
How about OR constraints? (state (or :MA :NY))
Ian
More information about the elephant-devel
mailing list