[elephant-devel] Fwd: some patches

Fri Aug 3 15:49:59 UTC 2007

I agree with all of your last email.

However, I think in your discussion of complexity of the existing
operations in the CL-SQL side
you are implying there are lots of O(n) operations when in fact this
only occurs when one 
indexes a slot that has a small number of values relative to the number
of objects or uses 
a very non-selective functional index.  I wrote that stuff before you
implemented slot indexing.
However, you are correct --- in some cases it will perform very poorly.
However, we now have
the Postmodern back-end (which I am already using in production), and it
probably doesn't suffer
from the same problems.  Moreover, the licensing issues with BDB seem to
have gone away.

I understand that you and I have slightly different approaches.  I am
thinking of Prevalence-style
implementations, that don't rely on disk-based indices (for example, DCM
uses hash-tables), and 
you are thinking of databases way to large to fit into memory, and
relying upon designing an 
index structure using the powerful Elephant mechanism to have an
efficient system.  I much
prefer that than supporting the notion of using an ORM, as you mention.
In fact, I hope Elephant
will be an ORM-zilla --- I hope it will teach people there is no need to
deal with that impedance
mismatch.

I think you clarification of the specification (by saying it is
unspecified) is very helpful.

With respect to the "open issue" you raise about "get-instances-by-
range", I wrote the following 
little article:

I take as a basic assumption that a B-Tree is a data structure, not an
abstract data type.  The abstract data type that it implements is an 
Ordered Set, that is a set equipped with a total order (a total order
being one that is antisymmetic, transitive and complete.)

A system like Elephant, or a relational database, should present an
abstract data type, and not expose implementation details.  That's
we call relational databases "relational" rather than "ISAM-al" or 
"BTree-al".  Even though in practice they don't (since typically
duplicate tuples are allowed, though implementing multisets rather than
sets, properly), what we call RDBMS proper are based upon the notion 
of the relation, an abstract data type.  I think Elephant should 
be based on the simpler notion of a Set or a Dictionary, and 
support the notion of an Ordered Set.

There are two basic Abstract Data Types that Elephant provides. This
is a kind legacy of the original API.  We implement the "Map" ADT and
the "Ordered Set" ADT.  Example of "Map" are "add-to-store(key,value)"
and "get-from-store(key,value)".  Persistent classes are examples
of the "Set" or "Ordered Set" ADT.  A fundamental problem with 
thinking in terms of these Abstract Data Types as defined by the 
Wikipedia, for example, is that they are not defined to have an
"enumerate all values" operation.  This is an abstract operation but
is essential in practice.  If you can't do that, you have to have 
a separate store to keep track of the objects in the store, and that 
is obviously no good.

(We provide nice mechanisms for enumerating all objects in both cases.
We have the "map-btree" function and you have written "map-class".
In practice, there are also cursors to do similar things.)

It seems to me an open question whether we should define persistent
classes to be a "Set" or an "Ordered Set".

As you point out, lisp doesn't provide an ordering for objects.  This
is because, unlike the integers, it is unclear what the ordering should
be.  Even for strings, the ordering should technically be dependent on
the language in which you interpret the strings---not every culture 
orders their letters in the same way.

It seems to me that everything would be a lot cleaner if we think
of fundamentally presenting the "Set" ADT rather than the "Ordered Set"
ADT.  The "Set" ADT does not support "get-instances-by-range", but the 
"Ordered Set" does (however, this must be based on a LISP-defined
order.)

In your previous email you suggest a specification wherein Elephant
makes not guarantee about the order of unorderable types.  This is 
effectively the specification of a "Set" type.  The clearest thing
of course would be our test suite reflected the distinction between
Sets and Ordered Sets by testing the persistence of order on Ordered
Sets but not relying upon it all for Sets.

If one thinks about the asymptotic complexity of both of these kinds
of ADTs, it is reasonable to expect O(log(n)) insertion, deletion, and
lookups in all cases.  

Elephant provides several index mechanisms.  The most elegant is 
the indexing of a slot value which you implemented; but there are 
also function indexes.  For example, one can implement a boolean
function "rhymes-with-cat" and build an index on that.  This allows
you to efficiently enumerate only lisp objects that for which 
"rhymes-with-cat" is true.  More reaonably, one could build an
index on a class slot that had a boolean value.  
This is the place where the current CL-SQL backend has an O(n)
operation--
when there are duplicate keys and index.  Recall that I coded that 
before you wrote the class indexing stuff.  I personally don't think it 
is very important, and I use DCM anyway.  If anybody actually
encounters 
this as a performance problem, I'll be happy to work on it --- 
unless of course you're using
Postgres, in which case I would recommend switching to the new (and not
officially released) Postmodern back-end.  I have done that on my
production site already.

On Fri, 2007-08-03 at 10:01 -0400, Ian Eslick wrote:

> First of all, BDB does not sort on the serialized values except for  
> values for which no lisp ordering exists.  BDB is given a C function  
> which decodes the serialized format on the fly, without talking to  
> Lisp, but properly orders all of lisp's orderable objects.
> 
> Orderable lisp types:
> 1. Numeric types: all numeric types are sorted identically to lisp (=  
> < > <= >=, etc)
>     Numeric types include fixnum, integer, float, double-float, rational
> 2. String types: all string types are sorted identically to lisp  
> (string< string<= string=, etc)
> 
> Unorderable lisp types:
> 1. Symbols: I do sort symbols by symbol-name as a convenience.
>     Lisp does not provide an ordering fn for symbols so this is additive
> 2. Objects, structs, etc: lisp does not provide an ordering function  
> for objects
> 3. Arrays, hash tables, etc: lisp does not provide an ordering  
> function for these types
> 
> Some of these types can be grouped by eq, eql, equal or equalp.  More  
> on this below.
> 
> Because a BTree requires some order for all objects, we need to  
> expand on the lisp specification of ordering.  First we sort on type  
> so that all unorderable objects of a given type are stored as a  
> group.  Then we choose an arbitrary order (binary rep).  Now most of  
> the hairy issues arise when we mix types in an index and want to  
> traverse a range in the index or do cursor ops on an index.  The  
> specification is this:
> 
> 1) Elephant makes no guarantee about the order of unorderable types.   
> Portable implementations should not depend on an order being  
> consistent.  Unorderable types should be treated as sets.  Anything  
> that is equalp in lisp will be grouped together in the data store  
> (arrays, structs, symbols)
> 
> 2) Elephant makes no guarantee about the relative order of different  
> types in an index.
> 
> *** There is an open issue: how to specify a range by object type.   
> You can't do :start and :end because you don't know what object will  
> be at the start and end due to #1 and #2 above.  I think this should  
> be dealt with as a special case and a less efficient procedure use
> 
> I can't agree with either of your proposals due to performance issues:
> 
> 1) The performance impact of (position, value) is extreme.  I turn  
> all of the BTree O(log N) operations into O(N).  I now have to read  
> all values out of the BTree until I identify the insertion point,  
> then renumber all the following elements.  This defeats the whole  
> purpose of having a BTree in the first place.
> 
> 2) I hate to say it but the CL-SQL approach to cursor operations is  
> not just terribly inefficient, it also defeats the purpose of having  
> indicies on disk by requiring that all values be pulled into memory  
> to support cursor or map operations.  I see the CL-SQL backend as a  
> solution to licensing problems for small databases.  For complex  
> queries over large databases one is better off with an object  
> relational solution like that implemented by CL-SQL's native metaclass.
> 
> The reason that lisp needs to implement a more complex sorting  
> function to support cursor operations over BTrees is that BTrees  
> benefit from ordering types relative to each other (we can't  
> intersperse numeric and non-numeric types without violating lisp  
> ordering) and the cursor fn's need to mirror this.  This doesn't  
> violate lisp ordering, it is additive.  The clarity of the required  
> constraints could be improved, though.
> 
>  > There is a hidden danger in relying upon an order based on the  
> serialized value---namely,
>  > you can now not swap out serializers without drastic side  
> effects.  Since one of the main
>  > ways that we can improve performance is by writing better  
> serializers (and, in particular,
>  > serializers that are specific to particular data types), this  
> seems like a bad idea.
> 
> If designers adhere to the two restrictions above, then a new  
> serializer that supports lisp ordering and introduces a new ordering  
> of types or within types that are unordered in lisp will have no  
> impact on application code.
> 
> Sorry if any point is unclear, I'm in a hurry.  Please ask clarifying  
> questions if necessary.
> 
> Ian
> 
> PS - This leaves me with a question.  Is it possible in any  
> relational DB to register a sorting fn that you can sort by when  
> doing a sorted query?  Typically rows are sorted by their unique ID  
> field (i.e. the key used in the underlying BTree).
> 
> 
> 
> On Aug 2, 2007, at 5:55 PM, Robert L. Read wrote:
> 
> > Yes, I think I understand this.  However, a costly alternative does  
> > exist:  just never let
> > BDB use its own order.  Always impose one that we can compute in  
> > lisp.  Then in
> > BDB you store a (position,value) pair instead of a value, and  
> > either ensure that BDB
> > sorts on the first part of the binary representation of the  
> > position the way you want it to,
> > or you add a lot of logic into the "cursor-next" operation.  This  
> > is how it is done on the
> > CLSQL side.
> >
> > It is almost certainly a bit slower, and it is certainly a bit  
> > harder to code.
> >
> > It seems to me that the root of the problem is that BDB does indeed  
> > order based on a
> > serialized value.  That is what we should remove.  Certainly, if  
> > someone were to write
> > a Pure-LISP backend, which I hope will occur eventually, it would  
> > seem silly for them to
> > have to respect an artifact inherited from BDB, when part of the  
> > purpose of such a project
> > is to escape dependence on BDB.
> >
> > Forgive me if I'm confused but I assert that we should reverse your  
> > argument: we should
> > force BDB to be isomorphic to a lisp sorter, not build a lisp  
> > sorted isomorphic to BDB.
> >
> >
> > On Tue, 2007-07-31 at 15:24 -0400, Ian Eslick wrote:
> >> The practical problem that led to the current design of index  
> >> sorting is that we cannot use lisp code to define the sorting  
> >> function for serialized values inside BDB Btrees (same problem I  
> >> imagine that Henrik had with postmodern). Instead, there is a  
> >> hairy custom C procedure that is registered with BDB that parses  
> >> the serialized format so that sorting is done first by type  
> >> (symbol, string, object, pobject, etc) followed by ordering within  
> >> numeric types, strings and symbols. Everything else is ordered  
> >> based on the byte ordering of its serialized representation. To  
> >> map across indices correctly, we need to know up front whether the  
> >> start value is less than the end value. And so we need a standard  
> >> lisp function that is isomorphic to the BDB sorting function.  
> >> Ideally postmodern would have a similar sorting function that    
> >> properly interprets the serialized format just like the BDB  
> >> function does. I think it's best to have a single standard  
> >> ordering that is as close to lisp's notion of ordering as possible  
> >> so we don't have to maintain different orderings. Ian PS - It  
> >> might be possible to have a lisp ordering function implement BDB's  
> >> notion of sorting by registering it as a callback, however it  
> >> would have to deserialize the BDB values each time. So the  
> >> problems with this are both stability concerns for foreign  
> >> callbacks and the performance impact of serialization/ 
> >> deserialization for internal BDB operations. On the cleanliness/ 
> >> performance axis, I think the current approach is the right  
> >> tradeoff (it's the original one Ben made, FYI). On Jul 31, 2007,  
> >> at 12:50 PM, Robert L. Read wrote: > Personally, I think the only  
> >> sensible way to handle this problem is > to require the user to >  
> >> specify an ordering function. We can of course provide a default,  
> >> > which will be error-prone > but tend to work most of the time. >  
> >> > The function called "my-generic-less-than" which is in the  
> >> source > tree now could be > a starting point for a generic  
> >> ordering. > > > On Tue, 2007-07-24 at 09:48 -0400, Ian Eslick  
> >> wrote: >> Robert and I have had some extended discussions on  
> >> ordering in >> indices. I think that all we really need to agree  
> >> on is _some_ >> canonical ordering. If we have mixed types in an  
> >> index, how should >> they be ordered relative to each other? In  
> >> BDB we have a C >> function which implements the ordering based on  
> >> the type tag and >> then based on the type within it. Are you  
> >> relying on a pure binary >> sort in postmodern? Robert or I will  
> >> get to submitting that patch >> shortly. I have recently sent in a  
> >> patch to lisp-compare<= so >> we'll see if we had to make parallel  
> >> changes. Thanks, Ian On Jul >> 24, 2007, at 3:50 AM, Henrik Hjelte  
> >> wrote: > I sent this message >> yesterday but I guess it got stuck  
> >> in the mailing > list filter. >> Perhaps the attachment was too  
> >> big. Since my > common-lisp.net >> user hhjelte does not have  
> >> write access to elephant I > have >> placed the patches from here  
> >> instead: > darcs get http://common- >> lisp.net/project/grand-prix/ 
> >> darcs/elephant > > ---------- >> Forwarded message ---------- >  
> >> From: Henrik Hjelte >> <henrik at evahjelte.com> > Date: Jul 23, 2007  
> >> 11:28 PM > Subject: >> some patches > To: elephant-devel at common- 
> >> lisp.net > > > Here are >> some darcs patches that might be of  
> >> interest. I had some > >> problems with map-index on db-postmodern  
> >> that made me almost rip >> my > hair of, but finally I made it to  
> >> work again. The problem is >> that > map-index for a string value  
> >> rely on the ordering in the >> btree > (continue-p makes use of  
> >> less than for strings). The >> postmodern > backend relies on how  
> >> the database backend orders >> things, which is not > always the  
> >> same thing. Is it a necessary >> feature that b-trees of > string  
> >> and objects are required to be >> ordered by lisp-compare<=? > >  
> >> In the process of solving the bug I >> have upgraded the test  
> >> framework > to use FiveAM instead of RT, It >> has in my opinion a  
> >> very nice syntax > and some useful features to >> track  
> >> dependencies between tests. I hope > you agree that it >> improves  
> >> on things. > > /Henrik Hjelte > >>  
> >> _______________________________________________ > elephant-devel  
> >> >> site list > elephant-devel at common-lisp.net > http://common- >>  
> >> lisp.net/mailman/listinfo/elephant-devel >>  
> >> _______________________________________________ elephant-devel >>  
> >> site list elephant-devel at common-lisp.net http://common-lisp.net/  
> >> >> mailman/listinfo/elephant-devel >  
> >> _______________________________________________ > elephant-devel  
> >> site list > elephant-devel at common-lisp.net > http://common- 
> >> lisp.net/mailman/listinfo/elephant-devel  
> >> _______________________________________________ elephant-devel  
> >> site list elephant-devel at common-lisp.net http://common-lisp.net/ 
> >> mailman/listinfo/elephant-devel
> > _______________________________________________
> > elephant-devel site list
> > elephant-devel at common-lisp.net
> > http://common-lisp.net/mailman/listinfo/elephant-devel
> 
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/elephant-devel/attachments/20070803/643eb563/attachment.html>