[elephant-devel] gp-export strikes back
Alex Mizrahi
killerstorm at newmail.ru
Thu Feb 4 11:53:46 UTC 2010
hi
Previous discussion of gp-export did not cover its design in details. I
think if we're going to bundle gp-export with elephant, it is better to
dicuss its features and shortcomings, so we can either try to fix them or
document them.
Basic idea how it works -- it goes through objects in database feeding them
to serializer one by one. Serializer writes objects to the stream.
Serializer here is a somewhat modified version of s-serialization from
cl-prevalence -- it was made more flexible, allowing changing how objects
are serialized via hooks.
Deserialization just works in reverse -- reads forms from file and creates
structures/instantiates objects.
Problems with this approach:
1) To deal with circular references or just objects being referenced from
multiple places, it keeps track of objects it already visited, and in case
it is encountered again, it writes object reference instead. That is, there
is a hash-table which references all objects and structures you're
exporting. This might be a bad idea if database is large and does not fit
into memory.
Particularly, it needs to track references to all objects, structures,
vectors and conses. It does not need to track object slots and string,
though. So, if database is large because of large pieces of text, it is not
a problem. But if database is large because there are lots of objects, it is
a problem.
2) Import works by reading objects one by one via CL:READ and then going
through s-expressions instantiating stuff. So, obviously, each individual
object with everything it references should fit in memory in its
s-expression form. It might be a problem if you have some huuge btree.
Particularly, root is (going to be) exported as a single object, so all data
stored in root must fit in memory.
3) Serialization has its limitations. That is, if you're using some clever
objects, it might not work. Basically, it is about as good as cl-prevalence,
plus it should be able to handle elephant objects without a problem.
These are quite fundamental limitations of this approach. I don't think we
are going to deal with them in this release, in some future version --
maybe. However, there might be some workarounds for the issue number 1:
a) there could be an option to disable reference tracking altogether
b) another option is to allow feeding serializer manually. If you know you
have a large database, you can feed serializer with a small batches,
resetting its state between them, so object references do not accumulate.
But then you're responsible for data integrity -- if something is exported
more than once, it is your own problem. Also, doing it manually, you might
forget to export something.
I'm planning to include only option b) just to increase flexibility. So it
you think you need option a) badly, drop me a note.
Current version has even more problems (and I'm going to fix some of them):
I) It does rather weird thing -- first writes data to a string, then reads
this string into s-expressions, and then writes them into a file. It's
because initial design did customizations on s-expression level. I've
identified this as not flexible enough (and rather cumbersome too) and added
hooks to serializer. But there is still a hook which allows modifications on
s-expression level. I'm going to remove it before release, so it will write
directly to a file.
II) Import and export were very dependent on elephant backend, to the point
you need to write a piece of code for each backend pair. Now I'm using
different approach -- all recognized elephant objects (descendants of
persistent-collection, basically) are serialized in a special way, using a
general elephant API rather than backend-specific stuff. Later you can
import it to a database of any type, as importer will use only general
elephant API rather than backend-specific one.
III) Approach described above has a consequence -- if you have some clever
object which is of class persistent-collection but doesn't work like
standard persistent collections, probably it won't work unless you
explicitly add support for it.
IV) Transient slots are exported. I think they should not be and I'm going
to fix this.
V) Instances are created with make-instance. This works fine for simple
objects, but some clever objects might suffer. We've already implemented
this stuff in elephant, I hope I can make it working with same semantics as
it has in elephant.
VI) This is not really a problem, but a design decision -- objects of type
btree-index won't be exported but instead would be recreated from
indexed-btree objects.
VII) Internal structure is rather cumbersome because of the layered
approach -- serialization part works mostly independently from the rest,
with only a few hooks in parts where I needed those hooks. I think if it was
designed as a whole it could be more flexible and elegant, but who knows...
I'm going to keep it as it is for a current release. If we're going to deal
with memory usage problems in future, it might make sense to redesign
architecture... But I'm not going to look that far.
And one more thing, Henrik says it is a good idea to rename gp-export to
something else. I dunno, gp-export is not that bad, as for me. Maybe it is
hard to say what it does from the name, but at least this name is
recognizable.
Henrik's ideas from README.md:
----
gp-export, lob-dump clob-dump
# lob-dump
Name? gp-export, lob-dump clob-dump
lob-dump is lisp objects dump, a way to export lisp objects to a file and
restore.
----
I don't like any of these particularly, but if we're going to make more
accurate name,
my suggestion is "clod-exim". It is supposed to mean "Commom Lisp object
database export and import".
It is important to mention both export and import, as some people might
think it does only export.
(Dictionary says that "clod" is "a big clumsy often slow-witted person",
this kind of reflects issues I've mentioned above :)
More information about the elephant-devel
mailing list