[elephant-devel] Lisp Btrees: design considerations

Fri May 16 03:06:17 UTC 2008

On Thu, 2008-05-15 at 10:39 +0300, Alex Mizrahi wrote:
>  RLR> The purpose of compression is of course NOT to make the store smaller.
>  RLR> It is to decrease the amount of I/O that is required to access the
>  RLR> store.
> 
> time to do I/O does not linearly depend on I/O amount.
> 
> for example, HDD reads data in blocks, with each block definitely not less 
> than 512 bytes (most likely larger due to HDD doing read-aheads).
> if you compress your data into 50 bytes, 10 times, it won't make I/O anyhow 
> faster.

You are correct, if you mean the time to perform a single "point query"
--- that is, a to look up a value by a single key.  Of course,
compressing it won't hurt.

> 
> moreover, random reads for HDD are orders (!) of magnitude slower than 
> linear reads -- time to reposition device head is larger than time reading 
> consecutive blocks.
> 
> i can show this in numbers, just checked on iometer, doing random reads:
> 512 byte blocks: 106 operations per second (~10 ms seek time), 0.05 MiB/sec
> 64 KiB   blocks:  95 operations per second (~10 ms seek time), 5.95 MiB/sec
> 
> you see, we have 120 times (!) more I/O, but very low difference in number 
> of operations per second.
> so compression only will matter to people who store at least hundred KiB as 
> their values.
> 
> or it will matter if people are doing lots of linear reads -- but it's hard 
> for me to imagine such database access patterns.

I think linear reads are quite common.  They are, after all, why we have
implemented range queries.  Moreover, the act of reading all the data
into memory is quite common if you are using a "Object Prevalance"
model.  In a relational database a query optimizer has a very accurate
model of the number of pages that will have to be read by any given
query, and attempts to search the space of all possible plans in order
to find the best one.  In many cases, this is indeed a linear scan; for
example many join operations can best be performed by performing a
linear scan of at least one of the tables in question.  The aggregate
functions (sum, min, max, avg, count) are another example.  Another
typical example is "access all page views of my website from Thursday to
Friday".

> 
>  RLR> there would be some databases for which it might be CPU-bound.  It is
>  RLR> hard for me to imagine this is true of most databases.
> 
> then it's worth defining what is "most databases" in case of elephant --
>  optimization without having a concrete target absolutely has no sense.

Yes, I agree.  I'm familiar with what you might call "enterprise"
applications --- large database-backed websites, with maybe a few
million rows, of a total size of only tens of gigabytes.  For such
applications, I think my assumptions hold; and these are the same
assumptions that I dealt with in my dissertation. In fact my entire
professional career could be summarize as applying caching to more and
more objects within an enterprise to avoid I/O costs.  Perhaps you can
post a description of your database usage pattern, to explain why it is
dominated by point queries.

> 
> IMHO most people will be working with databases smaller than few 
> gigabytes --
>  so read I/O has no sense for them (DDR reads are on scale of gigabytes per 
> second),
>  write I/O might matter, though.

Yes, write I/O does matter.  I am assuming this will be done in fairly
large pages---8K at one time was the standard; I'm not sure what is
favored now.

However, even in a 4 gigabyte or smaller database the read time really
matters.

> 
>  RLR> If one imagines an integer-valued slot with a Poisson distribution of
>  RLR> values, for example having an average value of less than a thousand
>  RLR> but still occasionally containing a much large number, then one can
>  RLR> expect a simple codec that employs this fact to use about 10 bits for
>  RLR> each value, and 30 bits for the outliers.  The result is again a
>  RLR> 3-fold reduction in I/O.
> 
> bits, huh? what storage nowadays works on bit level?

You are quite correct that most usage of storage is done at the byte and
word level---I am arguing that this is an opportunity for us.  Of
course, one must understand that the page will be read and written as an
8K chunk of bytes, which will contain carefully packed bit sets that do
not respect byte boundaries.  The result will be that more objects can
fit into an 8K page, which will tend to decrease I/O by that amount.
The byte is the smallest object that can be written to a file, but that
is irrelevant to how data is laid out within a block of bytes.

> even cache lines are 32 bytes wide, so make no difference to read 1 byte or 
> 32 of them.
Correct -- you read 8K at a time, and hope that because you care getting
300 objects in one page instead of 100, you don't have to read the next
two pages, and that because you have fewer pages to cache, the chance
that any page you need to look up will reside in your cache is much
higher.
> 
>  RLR> Object pointers may of course cross the entire space in theory, but
>  RLR> OIDs that we store in the database may indeed come from a set of quite
>  RLR> knowable size, allowing us to use less than 64 bits or 32 bits to
>  RLR> represent them.  If we know the type of a slot to be only OIDs of a
>  RLR> particular class, we may very well know that that class has only 2**20
>  RLR> members, and therefore demands only 20 bits, as opposed to 64, to
>  RLR> represent the entire space of possible OIDs.
> 
> if you expect database being large enough for I/O to matter, why do you 
> think then that there would be that few objects?

A database can't be so small that I/O doesn't matter.  I suppose a
database can be so small that performance doesn't matter---but if
performance matters, then I/O matters.

I think one of our differences is that you are looking at Elephant as it
is, and I imagining Elephant with a caching system, and perhaps a file
system that has a cache, and I'm not neglecting the relatively long time
it takes for a CPU to retrieve a byte from memory which is not already
in the CPU cache.

> 
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel