[elephant-devel] Complex queries
Daniel Salama
lists at infoway.net
Mon Nov 13 05:42:02 UTC 2006
Thanks. I'll be great to contribute anything to the project and a
pleasure working with you guys.
On Nov 12, 2006, at 1:11 PM, Ian Eslick wrote:
> I'm interested in this problem myself, but haven't had the time to do
> the appropriate research. Many SQL query engines are built on top
> of a
> BTree infrastructure so building and performing optimized queries
> should
> be quite doable in Elephant. However, I do know that efficient
> operations are fairly complicated.
>
> I'd be happy to support your work by including a search interface +
> simple compiler into the main Elephant release.
>
> I'm assuming you want to search for persistent objects and not
> standard
> objects stored in BTrees.
Correct. Just persistent objects
>
> There are two ways to extract a set of objects associated with a
> particular slot value:
> 1) Get all objects of the desired type and do a linear search
> 2) Create a slot-index for searchable slots. You can use the accessor
> functions in get-instances-by-value and get-instances-by-range. This
> returns the persistent object. Current cursor accesses require full
> deserialization of the target object but fortunately this doesn't
> include the slot, just the oid+classname, etc.
Pardon my lisp ignorance, but is there a way to introspect an object/
class in lisp and find out whether or not a slot is "indexable"? That
way, maybe whatever code is querying for data can take advantage of
any slot-index before defaulting to linear searching.
>
> If you extract a range of objects using a slot index, they'll
> already be
> ordered according to the sort order of the index values. You can do
> joins by filtering each list by whether the object exists in the next
> list or not.
>
> i.e.
>
> age = 18 (oid1, oid2, oid3, oid4, oid5, oid6, oid7 ... )
> name = "Cindy" (oid4, oid7, oid3 ...)
> state = MA (oid10, oid4, oid7 ...)
>
> Filtered objects = oid4, oid7 ...
>
I would assume that you might have meant intersection instead of
join? Right?
> Optimization issues:
>
> 1) Returned objects are in value order, not OID or object order; so
> each
> merge of lists is naively C*N^2 where C is the number of
> constraints (3
> above) and N is the average size of a returned list. This can be
> avoided by sorting each list (N log N) prior to merging which is
> effectively M log M (where M = the sum of all returned objects for all
> the constraints). In fact if you append the lists (N) and then sort
> removing duplicates M log M you get the same result. You then need
> another N log N walk of the list to order the elements according to
> the
> original list of constraints.
>
> I'm sure there's a better way, but this is a decent first order take.
>
It's too late at night for me to compute/validate this Big O notation
this but will confirm/comment tomorrow :)
> 2) Now this works fine when the # of objects in a given query are
> small. But let's say you want to pick all adults with the name
> "Jerry"
> from a 1M entry database of people. This is a range query
> (get-instances-by-range 'person 'age 18 120) and a value query
> (get-instances-by-value 'person 'first-name "Jerry"). However, the
> # of
> Jerry instances and in particular the # of age > 18 instances will be
> enormous (at least 500k). You might not even be able to fit all of
> these in memory!
>
This is exactly my main concern. Specially when, for example, the age
of a person is a slot of a class called demographic to which the
class person has a relationship.
> In this case we need to know how large the sets of objects are under
> given constraints. (I'm not sure if BDB provides a way to get
> access to
> the size of the index between two values - the total size of the
> tree +
> the height of the nodes should be telling as the BTrees are
> balanced...). In this case we can try to use cursors or slot
> accessors
> to do the filtering based on the total size and accessibility of the
> queries. At this point it gets somewhat complicated and I think it's
> worthwhile to appeal to the SQL implementation literature.
Agree. I'll see if I can research any SQL implementation on how they
compute this. After all, SQL is all about boolean math, so they have
to have a way to easily compute this.
> Also, what happens if a slot is not indexed? How do we search
> multiple
> object classes at the same time? For example, how would we specify a
> search over instances and it's super classes when the superclass also
> contain the slot constraints? (Currently class indexes are
> specific to
> a class name)
I suppose we'll find the answer of this somewhere in a SQL
implementation, unless anyone already knows and can contribute that
knowledge (I guess those are the pros and cons of going to an object
database rather than a relational database).
>
>
> Is this helpful? A first cut approach as described in #1 above will
> probably flesh out a large number of issues and shouldn't take that
> much
> time. If you are motivated, or have large queries, then it's
> appropriate to spend some time talking about how to handle the
> issues of
> large sets in #2.
>
I think #2 is where we (and I don't mean just my team) want to go
since it will provide the most flexibility.
> Query language:
>
> Does it make sense to have a full query language, or a simpler
> macros to
> make this simple? The easy macro approach probably doesn't need
> anything sophisticated:
>
> (retrieve-instances-by-constraints ((person :with-superclasses t)
> employee)
> ((age 18 120)
> (name "George")
> (state :MA))
> (:symbols-as-strings t :order t))
>
> Here we search for person objects, parents of person objects and
> employee objects for the three constraints. The search option
> :symbols-as-strings means to treat :MA as matching :MA or "MA" - that
> means two slot queries. Not sure that's a good idea, but it's an
> example of modifying how the query operates. :order means to order
> the
> resulting objects by their age, in this case. The order of ranges in
> the constraint list would determine the total order.
>
IMHO, macros are just fine. You can always repackage them in a
separate domain language.
> What if we want to do join queries?
> How about OR constraints? (state (or :MA :NY))
>
I have the same questions.
>
> Ian
>
Thanks,
Daniel
More information about the elephant-devel
mailing list