[elephant-devel] Uses of elephant

Thu Feb 23 18:56:34 UTC 2006

I thought with all the excitement regarding bugs, changes and such
related to the delayed 0.6.0 release, I thought I'd pass along an
application report built on top of 0.6.0-rc1.

My development on elephant has been targeted to support a web-mining
research project in which I am analyzing the presence on the web of a
particular type of data.

To do this I load 700k persistent objects into memory that represent my
dataset of queries.  Each query has an inverse-indexed timestamp slot
and inverse indexed query-string slot and some stats variables.  I then
go out and scrape various search engines for references to the strings. 
Overnight the system performed 20k web searches, loaded in 70k+ urls
including inverse indexing all the urls and timestamps.  This all went
off without a hitch.  BDB/Allegro and indexing have been very stable
over the past few days and I can now do those wonderful instance range
queries in well under a second (all pages looked up between 7 and 8am
this morning; join those pages the query db and get all queries
processed this morning between 7am and 8am, etc) to analyze my initial
dataset.

As a separate part of the project, I've also implemented a simple
full-text inverse index on top of the btree abstraction and have done
some stress testing of 100 pages of an average of 10kB size at a
throughput of 5pps.  I think I can get that up a bit, but for now it's
mostly btree seek limited as I'm using a btree as my sequence store.  
Sorted btree inserts aren't the cheapest thing in the world and inverse
indexing is pretty random access.  A cheap set operator (like
AllegroCache) might make such collection building go faster. 

Queries on the full text database are limited to simply optimized
phrases w/ wildcards and the NEAR operator between two phrases.   With
just a little more work & thinking about performance nested OR/AND/NEAR
queries should be possible.  Later this week I hope to scrape the URL
collection and inverse index all the resulting pages which will be a
true stress test of the whole system (hundreds of thousands of full-text
indexed web pages, etc).

If someone is really interested in the full-text indexing I think it
only depends on my natural language toolkit and document representation
model which I've released separately.  It might make a nice project to
(over time;) add it in as a contrib or module on top of elephant and
speed it up a bit at the same time.

Regards,
Ian