[elephant-devel] db-postmodern: performance considerations

Sat Dec 8 13:48:54 UTC 2007

Hi Alex & all,

1) In general we seem to have data stores that have different sweet  
spots and different optimal uses of an API.  We probably should come  
up with a Matrix of features and suggested uses for each of them.   
This includes what types are in lisp-sort-order and which are not for  
each data store.

We should also have a spec of the common/safe features.

2) In retrospect it was a mistake to expose the cursor API to the  
users, but Elephant started as a low-level interface to BDB with some  
minimal support for persistent objects and has grown from there so at  
the time it made sense.  I recommend we consider deprecating the  
cursor API for user use and find the use cases that people care about  
and implement a higher level API to them (abstract datastructures, a  
query language, etc).  Then we drop support for the cursor API (but  
document what works for those who want to shoot themselves in the foot).

The map API is intended to provide a slightly higher level interface  
to cursors so that you can imagine a traversal of a large set, only  
the current element of which 'must' be in memory so you can reasonably  
expect GC's to occur while doing a single map operation.

What prospects are there for having a reasonable map implementation in  
postmodern?

3) get-instances-by-range should be implementable for integers and  
strings on postmodern, but should flag a datastore-specific error if a  
different type is used.  Then you should be able to do a SQL query  
that returns all instances > some value and < another value.  That  
shouldn't be hard to implement.

4) btrees are lightweight in BDB, but heavy in SQL - sounds like psets  
need a postmodern specific implementation.  You need a table that  
implements a many-to-many relation between pset ids and object ids.

Thanks for all the hard work Alex!

Cheers,
Ian

PS - Has anyone validated my new default map-index under postmodern,  
or is that moot now that you have a specialized version?

On Dec 8, 2007, at 6:08 AM, Alex Mizrahi wrote:

> hello
>
> unfortunately rewrite of pm-cursor implementation is postponed for  
> now, it
> turned to be more complex than it seemed originally. and i do not  
> have test
> cases anyway..
> however i've made a patch that improves performance of
> get-instances-by-value (i've sent it to Henrik so it should be  
> available
> when he'll synchronize with upstream).
>
> so i'll describe profiling results here -- what's fast and what's  
> not with
> postmodern backend.
>
> 1. individual operation in btree -- setting and getting value -- are
> converted directly into SQL queries, and so they are quite fast (as  
> query in
> indexed SQL table can be). however, there's considerable client/server
> communication overhead (system calls, TCP/IP stack..), so making  
> tons of
> individual sets/gets is not very fast.
>
> however, it's possible to cache get operations on client with  
> postmodern
> backend.
> two caching options are available -- per-transaction-cache caches only
> inside one transaction.
> global-sync-cache caches data between transactions, synchronizing it  
> with
> other working instances -- it tracks changes and invalidates cache  
> entries
> accordingly. synchronization brings some overhead, and thus it's  
> good only
> for certain types of workload (mostly-read).
>
> it's possible to implement global-cache for single-instance mode,  
> that will
> do no synchronization, but this mode is not implemented because
> db-postmodern focuses of safety of use with multiple instances  
> working with
> database simultaneously.
>
> get-instance-by-value, after a patches suggested by Alain, uses btree
> get-value, and thus is cached too.
>
> there's gotcha when using caches with large data sets -- cache entries
> doesn't get garbage collected, so it's possible to run out of  
> memory. it's
> possible to use weak hash table in this case (see make-backend-cache  
> in
> pm-cache, no configuration option), but i don't know how good  
> caching will
> be with it. patch for some smarter solution is welcome.
>
> 2. get-instances-by-value, with my specialized map-index  
> implementation,
> uses SQL query to retrieve data directly, so it should be pretty fast.
> however it doesn't use caching.
>
> it doesn't uses cursor, so all instances are always returned, it may  
> be a
> problem if there's a lot of instances for some key.
>
> 3. cursors: in general, they do not work very well for now.
>
> first of all, they do not always return values in order as specified  
> by
> "lisp sorter". moreover, if you do not have keys of same type  
> (either all
> integers or strings) you get a random order. (with integers and  
> strings you
> get order according to SQL comparison rules). we are not going to  
> fix this
> issue (besides, maybe, allowing NILs to be mixed with integers or  
> strings
> safely), since we believe it's more important to support good  
> performance in
> practical cases (having all keys of same type). we could implement  
> some
> emulation mode that will retrieve all data and sort it on lisp side,  
> but
> probably it's better to use CLSQL backend if "lisp sort" ordering is a
> requirement of an application.
>
> then, probably the only thing that cursors are good at is iterating  
> all the
> sequence from start to the end.
>
> iterating some limited set of values from start *almost* works fine.  
> to do
> it efficiently on large table it's required to configure PostgreSQL --
> disable hashjoin and mergejoin, otherwise it thinks that it's faster  
> to
> process whole table rather than doing it incrementally. processing  
> table
> with 10000 items costs 60-70 milliseconds. (additionally i suspect  
> some
> small patch to db-postmodern is required to build correct indices for
> cursors to work incrementally).
>
> iterating from end doesn't work good -- it scans through all table  
> (probably
> even twice) due to bug in cursor implementation. this should be more- 
> or-less
> easily fixable, though.
>
> cursor-set performance depends on how far is key you search for from  
> the
> start of the table. (!). it's counting on postgresql side, though,  
> so for
> 10000 items table it should take about 70 msecs in worst case.  
> multiple
> cursor-set calls will get equally slow all.
>
> i have some ideas how to fix cursor implementation so it will not  
> depend on
> size of table (to the extent PostgreSQL does not depend, of course),  
> but i
> don't know when i'll have a chance to implement it. as i've  
> mentioned, i've
> currently do not even have a practical test case where cursor  
> iteration is
> used.
>
> 4. get-instances-by-range uses cursors, and so it inherits all it's  
> problems
>
> 5. psets: default implementation of psets makes instance of btree  
> for each
> of them. in db-postmodern btree has it's own SQL table, so it's not  
> a good
> idea to have thousands of psets. although they might be quite useful  
> to have
> in big amounts, so probably we'll invent something better in future  
> for
> them.
>
> with best regards, Alex 'killer_storm' Mizrahi.
>
>
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel