[elephant-devel] Postmodern backend; are values ever removed from the blob table

Alex Mizrahi killerstorm at newmail.ru
Thu Dec 11 10:06:35 UTC 2008


 RS> then the slot definitions in the slot table are removed, but the
 RS> actual slot _values_, which live in the blob table, are retained. Is
 RS> this intended behaviour?

yep, it is. it sort of sucks, i know.

whether or not it becomes a problem depends on what kind of data
you store in slots. blobs get reused (at least they should..), so,
 for example, all slots with value 4 will refer to same blob id,
and you can consider bloated blob table as a cache of some kind in this case 
:).

if you have constant inflow of unique data that becomes a problem, indeed.

 RS> avoid the downtime if at all possible. One thing that did occur to me
 RS> was that I could record the current maximum bid in the blob table,
 RS> then go through each tree table, including root, and also the slots
 RS> table, recording blob id references in each one. Then take the entire
 RS> set of bids from the blob table; anything present in the blob table
 RS> but not referred to in any of the trees or slots table, and with bid
 RS> less than or equal to the maximum bid when the procedure was started,
 RS> should be obsolete, and deleting it should be safe. Does this sound
 RS> reasonable, or am I missing something?

yep, right in general, some details seem to be wrong. first, you do not need
to go through _all_ tables, but only columns that store blob ids. there is 
some
metainformation attached to btrees, key-type and value-type. if key-type is 
:object,
blob ids are in qi column, if value-type is :object, value column contains 
blob ids.

then there are performance considerations -- if you have lots of data, 
allocating
temporary storage for all blob ids might be painful. if you run cleanup 
procedure
more-or-less frequently, there will be more alive bids than dead ones, so 
you
can optimize it ruling out alive bids early. for example, if you work mostly 
with objects,
you can first find the difference between blob and slots tables (because 
most alive
bids would be in slots):

SELECT blob.bid FROM blob LEFT JOIN slots ON blob.bid = slots.value
WHERE slots.value IS NULL

(warning: totally untested). this would return all dead bids and bids that
are referenced from other tables.then you can go through rest of tables,
pulling blob ids from them and eliminating ones you see in that data set,
in the end you end up with only dead ones.

then, you do not actually need recordning maximum bid if you're working
within a single transaction -- postgresql's transactional engine will do 
this
for you. (it works in similar way, btw -- ignores rows that have txn id 
larger
than current txn id). 







More information about the elephant-devel mailing list