[elephant-devel] gp-export strikes back

Thu Feb 4 11:53:46 UTC 2010

 hi

Previous discussion of gp-export did not cover its design in details. I 
think if we're going to bundle gp-export with elephant, it is better to 
dicuss its features and shortcomings, so we can either try to fix them or 
document them.

Basic idea how it works -- it goes through objects in database feeding them 
to serializer one by one. Serializer writes objects to the stream.

Serializer here is a somewhat modified version of s-serialization from 
cl-prevalence -- it was made more flexible, allowing changing how objects 
are serialized via hooks.

Deserialization just works in reverse -- reads forms from file and creates 
structures/instantiates objects.

Problems with this approach:
 1) To deal with circular references or just objects being referenced from 
multiple places, it keeps track of objects it already visited, and in case 
it is encountered again, it writes object reference instead. That is, there 
is a hash-table which references all objects and structures you're 
exporting. This might be a bad idea if database is large and does not fit 
into memory.

Particularly, it needs to track references to all objects, structures, 
vectors and conses. It does not need to track object slots and string, 
though. So, if database is large because of large pieces of text, it is not 
a problem. But if database is large because there are lots of objects, it is 
a problem.

 2) Import works by reading objects one by one via CL:READ and then going 
through s-expressions instantiating stuff. So, obviously, each individual 
object with everything it references should fit in memory in its 
s-expression form. It might be a problem if you have some huuge btree. 
Particularly, root is (going to be) exported as a single object, so all data 
stored in root must fit in memory.

 3) Serialization has its limitations. That is, if you're using some clever 
objects, it might not work. Basically, it is about as good as cl-prevalence, 
plus it should be able to handle elephant objects without a problem.

These are quite fundamental limitations of this approach. I don't think we 
are going to deal with them in this release, in some future version --  
maybe. However, there might be some workarounds for the issue number 1:

 a) there could be an option to disable reference tracking altogether
 b) another option is to allow feeding serializer manually. If you know you 
have a large database, you can feed serializer with a small batches, 
resetting its state between them, so object references do not accumulate. 
But then you're responsible for data integrity -- if something is exported 
more than once, it is your own problem. Also, doing it manually, you might 
forget to export something.

I'm planning to include only option b) just to increase flexibility. So it 
you think you need option a) badly, drop me a note.

Current version has even more problems (and I'm going to fix some of them):

 I) It does rather weird thing -- first writes data to a string, then reads 
this string into s-expressions, and then writes them into a file. It's 
because initial design did customizations on s-expression level. I've 
identified this as not flexible enough (and rather cumbersome too) and added 
hooks to serializer. But there is still a hook which allows modifications on 
s-expression level. I'm going to remove it before release, so it will write 
directly to a file.

 II) Import and export were very dependent on elephant backend, to the point 
you need to write a piece of code for each backend pair. Now I'm using 
different approach -- all recognized elephant objects (descendants of 
persistent-collection, basically) are serialized in a special way, using a 
general elephant API rather than backend-specific stuff. Later you can 
import it to a database of any type, as importer will use only general 
elephant API rather than backend-specific one.

 III) Approach described above has a consequence -- if you have some clever 
object which is of class persistent-collection but doesn't work like 
standard persistent collections, probably it won't work unless you 
explicitly add support for it.

 IV) Transient slots are exported. I think they should not be and I'm going 
to fix this.

 V) Instances are created with make-instance. This works fine for simple 
objects, but some clever objects might suffer. We've already implemented 
this stuff in elephant, I hope I can make it working with same semantics as 
it has in elephant.

 VI) This is not really a problem, but a design decision -- objects of type 
btree-index won't be exported but instead would be recreated from 
indexed-btree objects.

 VII) Internal structure is rather cumbersome because of the layered 
approach -- serialization part works mostly independently from the rest, 
with only a few hooks in parts where I needed those hooks. I think if it was 
designed as a whole it could be more flexible and elegant, but who knows... 
I'm going to keep it as it is for a current release. If we're going to deal 
with memory usage problems in future, it might make sense to redesign 
architecture... But I'm not going to look that far.

And one more thing, Henrik says it is a good idea to rename gp-export to 
something else. I dunno, gp-export is not that bad, as for me. Maybe it is 
hard to say what it does from the name, but at least this name is 
recognizable.
Henrik's ideas from README.md:
----
gp-export, lob-dump clob-dump
# lob-dump
Name? gp-export, lob-dump clob-dump
lob-dump is lisp objects dump, a way to export lisp objects to a file and 
restore.
----

I don't like any of these particularly, but if we're going to make more 
accurate name,
my suggestion is "clod-exim". It is supposed to mean "Commom Lisp object 
database export and import".
It is important to mention both export and import, as some people might 
think it does only export.
(Dictionary says that "clod" is "a big clumsy often slow-witted person", 
this kind of reflects issues I've mentioned above :)