[elephant-cvs] CVS elephant/doc

ieslick ieslick at common-lisp.net
Fri Apr 6 02:51:47 UTC 2007


Update of /project/elephant/cvsroot/elephant/doc
In directory clnet:/tmp/cvs-serv21893/doc

Modified Files:
	user-guide.texinfo 
Log Message:
Trial pset abstraction; fix for debug serialize of complex and more documentation edits

--- /project/elephant/cvsroot/elephant/doc/user-guide.texinfo	2007/04/04 15:28:28	1.9
+++ /project/elephant/cvsroot/elephant/doc/user-guide.texinfo	2007/04/06 02:51:47	1.10
@@ -89,41 +89,33 @@
 When you finish your application, @code{close-store} will close the
 store controller.  Failing to do this properly may lead to a need to
 run recovery on the data store during the next session.  Again, see
-the relevant data store sections for more details.
+the relevant data store sections for more detail.
+
 
 @node Serialization details
 @comment node-name, next, previous, up
 @section Serialization details
 
-This section captures the details of how various types of objects are
-serialized and some considerations to keep in mind when storing lisp
-objects.
-
-The high level factors that you need to keep in mind are:
-
- at itemize
- at item Circular References: 
-The serializer properly handles circular references to/from objects
-such as cons cells, standard objects, arrays, etc.  It accomplishes
-this by assigning an ID to any non-atomic object and keeping a mapping
-between previously serialized objects and these ids.
- at end itemize
+There are consequences to trying to move values from lisp memory onto
+disk in order to persist them.  The first consequence is that that
+pointers cannot be guaranteed to be valid and so references to lisp
+objects cannot be maintained.  This is very similar to the problems
+with passing references in foreign function interfaces.  The second,
+and more frustrating limitation is that lisp operations that commit
+side effects on aggregate objects, such as objects, arrays, etc,
+cannot be trapped and replicated on the disk representation.  This
+leads up to a very important consequence: all lisp objects are stored
+by @emph{value}.  This policy has a number of consequences which are
+detailed below.
 
-Here is an introduction to 
-
- at itemize
- at item 
- at end itemize
-
-We will also review and add to the considerations outlined in the tutorial:
+ at subsection{Restrictions of Store-by-Value}
 
 @enumerate
-
-
- at item @strong{Lisp identity can't be preserved}.  Since this is a store which
-persists across invocations of Lisp, this probably doesn't even make
-sense.  However if you get an object from the index, store it to a
-lisp variable, then get it again - they will not be eq:
+ at item @strong{Lisp identity can't be preserved}.  
+      Since this is a store which persists across invocations of Lisp,
+this probably doesn't even make sense.  However if you get an object
+from the index, store it to a lisp variable, then get it again - they
+will not be eq:
 
 @lisp
 (setq foo (cons nil nil))
@@ -137,17 +129,35 @@
 => NIL
 @end lisp
 
- at item @strong{Nested aggregates are stored in one buffer}.  
+ at item @strong{Nested aggregates are serialized recursively into a single buffer}.  
 If you store an set of objects in a hash table you try to store a hash
 table, all of those objects will get stored in one large binary buffer
-with the hash keys.  This is true for all other aggregates that can
-store type T (cons, array, standard object, etc).
+with the hash keys.  This is true for all aggregates that can store
+type T (cons, array, standard object, etc).
 
 @item @strong{Circular References}.
-The serializer properly handles circular references to/from objects
-such as cons cells, standard objects, arrays, etc.  It accomplishes
+One benefit provided by the serializer is that the recursive
+serialization process does not lead to infinite loops when they
+encounter circular references among aggregate types.  It accomplishes
 this by assigning an ID to any non-atomic object and keeping a mapping
-between previously serialized objects and these ids.
+between previously serialized objects and these ids.  This same
+mapping is used to reconstruct references in lisp memory on
+deserialization such that the original structure is properly
+reproduced.
+
+ at item @strong{Storage limitations}.
+The serializer writes sequentially into a contiguous foreign byte
+array before passing that array to a given data store's API.  There
+are practical limits to the size of the foreign buffer that lisp can
+allocate (usually somewhere on the order of 10-100MB due to address
+space fragmentation).  Moreoever, most data stores will have a
+practical limit to the size of a transaction or the size of key or
+value they will store.  Either of these considerations should
+encourage you to plan to limit the size of objects that you serialize
+to disk.  A good rule of thumb is to stay under a handful of
+megabytes.  We have successfully serialized arrays over 100MB in the
+past, but have not tested the robustness of these large values over
+time.
 
 @item @strong{Mutated substructure does not persist}.
 
@@ -163,15 +173,6 @@
 elephant does not automatically provide persistent collections.  If you 
 want to persist every access, you have to use BTrees (@pxref{Using BTrees}).
 
- at item @strong{Storage limitations}.
-The serializer writes sequentially into a foreign memory byte array
-before passing that array to a given data store's API.  There are
-practical limits to the size of this buffer.  Moreoever, in most data
-stores there is a practical limit to the size of a transaction.
-Either of these considerations should encourage you to plan to limit
-the size of objects that you serialize to disk.  A good rule of thumb
-is to stay under a megabyte.
-
 @item @strong{Serialization and deserialization can be costly}. While 
 serialization is pretty fast, but it is still expensive to store large
 objects wholesale.  Also, since object identity is impossible to
@@ -185,8 +186,147 @@
 This is the common read-modify-write problem in all databases.  We will talk
 more about this in the @ref{Transactions} section.
 
+ at item @strong{Byte Ordering}.  
+      The primitive elements such as integers are written to disk in
+the native byte ordering of the machine on which the lisp runs.  This
+means that little endian machines cannot read values written by big
+endian machines and vice a versa. 
+
+ at item @strong{Unicode codes and Serialized Strings}.
+      The characters and strings stored to disk can store and recover
+lisp character codes that implement unicode, but the character maps
+are the lisp character maps (produced by @code{char-code}) and not
+strict unicode codes so lisps may not be able to interoperably read
+characters unless they have identical character code maps for the
+character sets you are reading and writing.  All standard ASCII
+strings should be portable.  Here is what we know about specific
+lisps, but this should not be taken as gospel.
+ at itemize 
+ at item SBCL: In versions with the :sb-unicode feature (after 0.8.17) @code{char-code}
+            produces proper Unicode codes
+ at item Allegro: In the interational version, @code{char-code} produces proper Unicode codes for codes < 2^16
+ at item OpenMCL: OpenMCL 1.1 supports unicode, we are unsure about earlier versions
+ at item Lispworks: Lispworks 5 does not, to our knowledge, produce proper Unicode characters.  
+(@emph{This can be fixed on request iff users ask for it and are willing to pay the performance hit})
+ at end itemize
+
 @end enumerate
 
+ at subsection{Atomic Types}
+
+Atomic types have no recursive substructure.  That is they cannot
+contain arbitrary objects and are of a bounded size.  (Bignums are an
+exception, but they have a predictable structure and cannot reference
+or otherwise encapsulate other objects).  The following is a list of
+atoms and a discussion of how they are serialized.
+
+ at itemize
+ at item @strong{@code{nil}}: 
+      nil has it's own special tag in the serializer so it is easily
+identifiable.  @code{nil} is an awkward value as it is also a boolean.
+The boolean value @code{t} is stored as the symbol 'T.
+ at item @strong{fixnums}:
+      The serializer will store both 32-bit and 64-bit fixnums.  Both
+types of fixnums are readable by a 32-bit or 64-bit lisp, but 64-bit
+fixnums are only written if the underlying lisp is supports fixnums
+between 32 and 64 bits.
+ at item @strong{bignums}:
+      Bignums are broken into a sequence of fixnum-sized chunks and
+assembled by masking words onto the bignum.  This is awfully
+expensive, but it's always correct and fully portable.
+ at item @strong{small-float}:
+      Supported only on Lispworks 5 where type @code{small-float} is
+not equivalent to type @code{single-float} as it is on all other
+supported platforms.  Written to disk and deserialized as a single
+float so any memory footprint savings of @code{small-float} is lost.
+ at item @strong{single-float}:
+      32-bit floating point numbers
+ at item @strong{double-float}:
+      64-bit floating point numbers
+ at item @strong{rational}:
+      A rational is merely a ratio of two integers stored as fixnums or bignums.
+ at item @strong{complex}:
+      A complex is a pair of floating point values, rationals or integers.
+ at item @strong{char}:
+      Standalone chars are represented by their char-code and are
+stored in 32-bit format to ensure that all lisps are stored correctly.
+ at item @strong{strings}:
+      Strings can be represented as 8, 16 or 32 bit sequences
+depending on the character sizes used in the underlying lisp.  Because
+strings can be such a large percentage of on-disk space, Elephant uses
+a peculiar method of encoding strings.  Strings are converted from
+their in-memory representation using @code{char-code}.  The size of
+the first character dictates the word width used for encoding.  If a
+character violates the word width, the string encoding is aborted and
+the next larger width is chosen.  The rationale here is that many
+strings consist of Latin characters with codes less than 256.  Strings
+stored in other character sets tend to all be of codes > 256.
+Therefore it is likely that the first character will properly
+determine the word size of the string.  (@emph{On request, we can easily make
+a configuration option to fix the word width for encoding})
+ at item @strong{pathname}:
+      A pathname is merely the @code{namestring} of the path object
+stored as a string.  The path object is reconstructed from the
+namestring using @code{parse-namestring} during deserialization.
+ at item @strong{symbol}: 
+      Symbols are stored as two strings, the package name and the symbol name in that package.  When deserialized, the target package is searched for and the symbol is interned in that package.
+ at end itemize
+
+ at subsection{Aggregate Types}
+
+The next list are @emph{aggregate} types, meaning that elements of
+that type can contain references to elements of type @code{T}.  That
+means, in theory, that storing an aggregate type to disk that refers
+to other objects can copy every reachable object!  This is a direct
+and dire consequence of the ``store-by-value'' restriction.
+(@xref{Persistent Classes and Objects} for how to design around the
+store-by-value restriction}).
+
+This list describes how aggregates are handled by the serializer.
+
+ at itemize
+ at item @strong{cons}:
+      Cons is simply stored as a cons record containing two nested
+elements.  Linear lists are not treated specially (i.e. no cdr-coding)
+by the serializer.
+ at item @strong{array}:
+      Arrays are stored as sequences of nested, serialized elements.
+The array parameters are also stored so that arrays with fill
+pointers, adjustable arrays can be stored and reconstructed.  The only
+arrays that cannot be reproduced are displaced arrays, which are
+copied by value and reconstructed as standard arrays during
+deserialization.
+ at item @strong{hash-table}:
+      Hash tables are stored as a sequence of key-value pairs, where
+the key and value can be any serializable value.  On deserialization,
+the reconstructed key and value quantities are written incrementally
+into the hash table.  The hash table does remember it's test, rehash
+size and threshold and it's total count.  The final size of the new
+hash table is set to @code{(* (/ size reshash-threshold) rehash-size)}.
+ at item @strong{struct}:
+      Structure objects are serialized using the metaprotocol.  Each
+slot where the value is bound is serialized by serializing the slot
+name and the value in sequence.  The underlying lisp must support the
+ at code{struct-constructor} method so that a new, empty instance of the
+structure can be created and then populated by the stored keys and
+values.
+ at item @strong{object}:
+      Instances of subclasses of standard-object are stored almost
+identically to structs.  The type of the object is stored and the
+object slots with bound values are serialized as slotname-value pairs.
+To read an object of this type, the lisp image must have the class
+defined and it must have at least the slots that are stored on disk.
+There is no good method for schema evolution (redefining objects to
+have less slots) of ordinary classes.
+ at end itemize
+
+
+One final strategic consideration is whether you plan on sharing the binary
+database between machines or between different lisp platforms on the
+same machine.  This is almost possible today, but there are some
+restrictions.  In the section @ref{Repository Migration and Upgrade}
+we will discuss possible ways of migrating an existing database across
+platforms and lisps.
 
 @node Persistent Classes and Objects
 @comment node-name, next, previous, up




More information about the Elephant-cvs mailing list