[elephant-devel] Uses of elephant
Ian Eslick
eslick at csail.mit.edu
Thu Feb 23 18:56:34 UTC 2006
I thought with all the excitement regarding bugs, changes and such
related to the delayed 0.6.0 release, I thought I'd pass along an
application report built on top of 0.6.0-rc1.
My development on elephant has been targeted to support a web-mining
research project in which I am analyzing the presence on the web of a
particular type of data.
To do this I load 700k persistent objects into memory that represent my
dataset of queries. Each query has an inverse-indexed timestamp slot
and inverse indexed query-string slot and some stats variables. I then
go out and scrape various search engines for references to the strings.
Overnight the system performed 20k web searches, loaded in 70k+ urls
including inverse indexing all the urls and timestamps. This all went
off without a hitch. BDB/Allegro and indexing have been very stable
over the past few days and I can now do those wonderful instance range
queries in well under a second (all pages looked up between 7 and 8am
this morning; join those pages the query db and get all queries
processed this morning between 7am and 8am, etc) to analyze my initial
dataset.
As a separate part of the project, I've also implemented a simple
full-text inverse index on top of the btree abstraction and have done
some stress testing of 100 pages of an average of 10kB size at a
throughput of 5pps. I think I can get that up a bit, but for now it's
mostly btree seek limited as I'm using a btree as my sequence store.
Sorted btree inserts aren't the cheapest thing in the world and inverse
indexing is pretty random access. A cheap set operator (like
AllegroCache) might make such collection building go faster.
Queries on the full text database are limited to simply optimized
phrases w/ wildcards and the NEAR operator between two phrases. With
just a little more work & thinking about performance nested OR/AND/NEAR
queries should be possible. Later this week I hope to scrape the URL
collection and inverse index all the resulting pages which will be a
true stress test of the whole system (hundreds of thousands of full-text
indexed web pages, etc).
If someone is really interested in the full-text indexing I think it
only depends on my natural language toolkit and document representation
model which I've released separately. It might make a nice project to
(over time;) add it in as a contrib or module on top of elephant and
speed it up a bit at the same time.
Regards,
Ian
More information about the elephant-devel
mailing list