[elephant-cvs] CVS elephant/doc

Sun Mar 25 11:04:39 UTC 2007

Update of /project/elephant/cvsroot/elephant/doc
In directory clnet:/tmp/cvs-serv8936

Modified Files:
	tutorial.texinfo user-guide.texinfo 
Log Message:
Latest documentation edits - still very broken so don't look!

--- /project/elephant/cvsroot/elephant/doc/tutorial.texinfo	2007/03/24 12:16:02	1.7
+++ /project/elephant/cvsroot/elephant/doc/tutorial.texinfo	2007/03/25 11:04:38	1.8
@@ -6,15 +6,15 @@
 @cindex Tutorial
 
 @menu
-* Overview:: Overview of elphant's features
+* Overview:: Overview of elephant's features.
 * Getting Started:: Opening and accessing a store.
 * The Store Root:: Accessing persistent data.
 * Serialization:: Storage semantics for lisp values.
 * Persistent Classes:: Persistent semantics for objects.
 * Rules about Persistent Classes:: What you need to know.
-* Persistent collections:: Keep track of collections of objects.
-* Class Indices:: Simple way to keep track of instances.
-* Using Transactions:: Enabling ACID database properties.
+* Persistent collections:: Keep track of collections of values.
+* Indexing Persistent Classes:: Simple way to keep track of persistent instances.
+* Using Transactions:: Providing ACID database properties.
 @end menu
 
 @node Overview
@@ -377,82 +377,249 @@
 the most recent commits, right?  Note that this can be used as a weak
 form of IPC.  But also note that in particular, if your slot value is
 not an immediate value, reading will cons or allocate the value.  Gets
-are not an expensive operation; you can perform tens of thousands of
-primitive reads per second.  However, if you're concerned, cache
-large values.
+are not an expensive operation; you can perform thousands to tens of
+thousands of primitive reads per second.  However, if you're
+concerned, cache large values in memory.
 
 @node Persistent collections
 @comment node-name, next, previous, up
 @section Persistent collections
 
+The remaining problem outlined in @ref{Serialization} is that
+operations which mutate aggregate objects are not persistent.  While
+we solved this problem for objects, there is no collection type such
+as arrays, hashes or lists which provide this ability.  Elephant
+provides two primary types of collections, a @code{btree} and a
+ at code{indexed-btree}.
+
+We will focus on the core concepts of BTrees in this section, for a
+detailed review including the behavior of indexed BTrees, @pxref{Using
+BTrees}, @ref{Secondary Indices} and @ref{Using Cursors} in the
+ at ref{User Guide}.
 
+Elephant provides a rich data structure called a BTree for storing
+large sets of key-value pairs.  Every key-value pair is stored
+independantly in Elephant just like persistent object slots.
+Therefore they inherit all the nice properties of persistent objects:
+identity, fast serialization / deserialization, no merge conflicts,
+etc.  
+
+The primary interface to @code{btree} objects is through
+ at code{get-value}.  You can also @code{setf} @code{get-value} to store
+key-value pairs.
+
+ at lisp
+(defvar *friends-birthdays* (make-btree))
+=> *FRIENDS-BIRTHDAYS*
+     
+(add-to-root "friends-birthdays" *friends-birthdays*)
+=> #<BTREE @{4951CF6D@}>
+     
+(setf (get-value "Ben" *friends-birthdays*)
+      (encode-universal-time 0 0 0 14 4 1973))
+=> 2312600400
+
+(setf (get-value "Andrew" *friends-birthdays*)
+      (encode-universal-time 0 0 0 22 12 1976))
+=> 2429071200
+     
+(get-value "Andrew" *friends-birthdays*)
+=> 2429071200
+=> T
+     
+(decode-universal-time *)
+=> 0
+   0
+   0
+   22
+   12
+   1976
+   2
+   NIL
+   6
+ at end lisp
+
+In addition to the hash-table like interface, @code{btree} stores
+pairs sorted by the lisp value of the key, lowest to highest.  This is
+works well for numbers, strings, symbols and persistent objects, but
+due to serialization semantics may be strange for other values like
+arrays, lists, standard-objects, etc.  
+
+Because elements are sorted by value, we should be able to iterate
+over all the elements of the BTree in order.  We entered the data in
+reverse alphabetic order, but will read it out in alphabetical order.
+
+ at lisp
+(map-btree (lambda (k v) 
+             (format t "name: ~A utime: ~A~%" k 
+               (subseq (multiple-value-list (decode-universal-time v)) 3 6)))
+           *friends-birthdays*)
+"Andrew"
+"Ben"
+=> NIL
+ at end lisp
 
- at node Class Indices
+But what if we want to read out our friends from oldest to youngest,
+or youngest to oldest?  In the @ref{User Guide}, specifically the
+section on @ref{Secondary indices} you will discover ways to sort
+according to the order defined by a lisp function of the key-value pair.
+
+ at node Indexing Persistent Classes
 @comment node-name, next, previous, up
- at section Class Indices
+ at section Indexing Persistent Classes
+
+Class indices simplify the recording and retrieving of persistent
+objects.  An indexed class stores every instance of the class that is
+created, ensuring that every object is automatically persisted between
+sessions.
+
+ at lisp
+(defpclass friend ()
+  ((name :accessor name :initarg :name)
+   (birthday :initarg :birthday))
+  (:index t))
+=> #<PERSISTENT-METACLASS FRIEND>
+
+(defmethod print-object ((f friend) stream)
+  (format stream "#<~A>" (name f)))
+
+(defun encode-birthday (dmy)
+  (apply #'encode-universal-time 
+    (append '(0 0 0) dmy)))
+
+(defmethod (setf birthday) (dmy (f friend))
+  (setf (slot-value f 'birthday)
+        (encode-birthday dmy))
+  dmy)
+
+(defun decode-birthday (utime)
+  (subseq (multiple-value-list (decode-universal-time utime)) 3 6))
+
+(defmethod birthday ((f friend))
+  (decode-birthday (slot-value f 'birthday)))
+ at end lisp
+
+Notice the class argument ``:index t''.  This tells Elephant to store
+a reference to this class.  Under the covers, there are a set of
+btrees that keep track of classes, but we won't need to worry about
+that as all the functionality has been nicely packaged for you.
+
+We also created our own birthday accessor for convenience so it
+accepts and returns birthdays in a list consisting of month, day and
+year such as @code{(27 3 1972)}.  The index key will be the encoded
+universal time, however.
 
-Class indices are a very convenient way of gaining the efficiency that
-BTrees provide.  If a given object is most often sought by the value
-of one of its slots, which is of course quite common, it is convenient
-to define a class index on that slot, although the same functionality
-can be gained in a more complicated way through the use of secondary
-indices.
-
-The file @file{tests/testindexing.lisp} provides many useful examples
-of both declaring class indexes and using the API to seek objects using them.
-
-The following code from that file in the test ``indexing-range'' demonstrates
-the convenience of a class indexes and the function ``get-instances-by-range''.
-Note in the definition of the ``slot1'' the keyword ``:index'' is used to 
-specify that this slot should be indexed.
-
- at lisp
-      (defclass idx-four ()
-	((slot1 :initarg :slot1 :initform 1 :accessor slot1 :index t))
-	(:metaclass persistent-metaclass))
-      
-
-      (defun make-idx-four (val)
-	(make-instance 'idx-four :slot1 val))
-      
-      (with-transaction ()
-	(mapc #'make-idx-four '(1 1 1 2 2 4 5 5 5 6 10)))
-
-      (let ((x1 (get-instances-by-range 'idx-four 'slot1 2 6))
-	    (x2 (get-instances-by-range 'idx-four 'slot1 0 2))
-	    (x3 (get-instances-by-range 'idx-four 'slot1 6 15))
-	    )
-	(format t " x1 = ~A~%" (mapcar #'slot1 x1))
-      	(format t " x2 = ~A~%" (mapcar #'slot1 x2))
-	(format t " x3 = ~A~%" (mapcar #'slot1 x3))
- at end lisp
-
-Additionally, the test 
- at lisp 
-(do-test 'INDEXING-TIMING) 
- at end lisp
-Can be used to judge the performance of indexing a medium sized dataset.
-
-The file @file{src/elephant/classindex.lisp} provides the source code and
-some crisp documentation of the class indexing system.
-
-Note that for retrieving items, the API is provided by three functions:
-
- at lisp 
-(defgeneric get-instances-by-class (persistent-metaclass))
-(defgeneric get-instances-by-value (persistent-metaclass slot-name value))
-(defgeneric get-instances-by-range (persistent-metaclass slot-name start end))
- at end lisp
-
-By using these functions, any class that is a subclass of persistent-metaclass
-can also be thought of as a container of all of its instances, which are 
-persistent in the database between lisp invocations.  Moreover an individual 
-object can be looked up on O(log n) time via a value on which it is indexed.
-
-At the top of this same file, you will find the a description of the API
-which can be used to dynamically add and remove indexes.  (Adding and 
-removing indexes can also be performed by a re-execution of the ``defclass''
-macro with different values.)
+Now we can easily manipulate all the instances of a class.
+
+ at lisp
+(defun print-friend (friend)
+  (format t " name: ~A birthdate: ~A~%" (name friend) (birthday friend)))
+
+(make-instance 'friend :name "Carlos" :birthday (encode-birthday '(1 1 1972)))
+(make-instance 'friend :name "Adriana" :birthday (encode-birthday '(24 4 1980)))
+(make-instance 'friend :name "Zaid" :birthday (encode-birthday '(14 8 1976)))
+
+(get-instances-by-class 'friends)
+=> (#<Carlos> #<Adriana> #<Zaid>)
+
+(mapcar #'print-friend *)
+ name: Carlos birthdate: (1 1 1972)
+ name: Adriana birthdate: (24 4 1980)
+ name: Zaid birthdate: (14 8 1976)
+=> (#<Carlos> #<Adriana> #<Zaid>)
+ at end lisp
+
+But what if we have thousands of friends?  Aside from never getting
+work done, our get-instances-by-class will be doing a great deal of
+consing, eating up lots of memory and wasting our time.  Fortunately
+there is a more efficient way of dealing with all the instances of a
+class.
+
+ at lisp
+(map-class #'print-friend 'friend)
+ name: Carlos birthdate: (1 1 1972)
+ name: Adriana birthdate: (24 4 1980)
+ name: Zaid birthdate: (14 8 1976)
+=> NIL
+ at end lisp
+
+ at code{map-class} has the advantage that it does not keep references to
+objects after they are processed.  The garbage collector can come
+along, clear references from the weak instance cache so that your
+working set is finite.  The list version above conses all objects into
+memory before you can do anything with them.  The deserialization
+costs are very low in both cases.
+
+Notice that the order in which the records are printed are not sorted
+according to either name or birthdate.  Elephant makes no guarantee
+about the ordering of class elements, so you cannot depend on the
+insertion ordering shown here.
+
+So what if we want ordered elements?  How do we access our friends
+according to name and birthdate?  This is where slot indices come into
+play.
+
+ at lisp
+(defpclass friend ()
+  ((name :accessor name :initarg :name :index t)
+   (birthday :initarg :birthday :index t)))
+ at end lisp
+
+Notice the :index argument to the slots.  Also notice that we dropped
+the class :index argument.  Specifying that a slot is indexed
+automatically registers the class as indexed.  While slot indices
+increase the cost of writes and disk storage, each entry is only
+slightly larger than the size of the slot value.  Numbers, small
+strings and symbols are good candidate types for indexed slots, but
+any value may be used, even different types.
+
+Once we've indexed a slot, we can use another set of
+ at code{get-instances} and @code{map} functions to access objects
+in-order and by their slot value.
+
+ at lisp
+(get-instances-by-value 'friends 'name "Carlos")
+=> (#<Carlos>)
+
+(get-instances-by-range 'friends 'name "Adam" "Devin")
+=> (#<Adriana> #<Carlos>)
+
+(get-instances-by-range 'friend 'birthday (encode-birthday '(1 1 1974)) (encode-birthday '(31 12 1984)))
+=> (#<Zaid> #<Adriana>)
+
+(mapc #'print-friend *)
+ name: Zaid birthdate: (14 8 1976)
+ name: Adriana birthdate: (24 4 1980)
+=> (#<Zaid> #<Adriana>)
+
+(map-class-index #'print-friend 'friend 'name "Carlos" "Carlos")
+ name: Carlos birthdate: (1 1 1972)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'name "Adam" "Devin")
+ name: Adriana birthdate: (24 4 1980)
+ name: Carlos birthdate: (1 1 1972)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'birthday
+                 (encode-birthday '(1 1 1974)) 
+                 (encode-birthday '(31 12 1984)))
+ name: Zaid birthdate: (14 8 1976)
+ name: Adriana birthdate: (24 4 1980)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'birthday nil (encode-birthday '(10 10 1978)))
+ name: Carlos birthdate: (1 1 1972)
+ name: Zaid birthdate: (14 8 1976)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'birthday
+                 (encode-birthday '(10 10 1975))
+                 nil)
+ name: Zaid birthdate: (14 8 1976)
+ name: Adriana birthdate: (24 4 1980)
+=> NIL
+ at end lisp
 
 You can enable/disable class indexing for an entire class.  When you disable
 indexing all references to instances of that class are lost.  If you re-enable
@@ -606,16 +773,14 @@
 to retry the transaction a fixed number of times by re-executing the
 whole body.
 
-You can see in the example that a single statement in lisp can include
-several primitive Elephant operations as in the decf statement in
-withdraw.  It takes some careful thinking to properly implement
-transactions, for a complete treatment @pxref{Transaction details}.
-
-The other thing transactions can give us is the ability to put off
-synchronizing our data to disk.  The expensive part of persistent
-writes is flushing data to disk.  Since a transaction caches values,
-all the read and written values are kept in memory until the
-transaction is complete, this can dramatically improve performance.
+The other value transactions provide is the capability to delay
+flushing dirty data to disk.  The most time-intensive part of
+persistent operations is flushing newly written data to disk.  Using
+the default auto-commit behavior requires a flush on every operation
+which can become very expensive.  Because a transaction caches values,
+all the values read or written are cached in memory until the
+transaction completes, dramatically decreasing the number of flushes
+and the total time taken.
 
 @lisp
 (defpclass test ()
@@ -625,8 +790,9 @@
          (make-instance 'test :slot1 i)))
 @end lisp
 
-This can take a long time.  Here each new objects that is created has
-to independantly write its value to disk and accept a disk flush cost.
+This can take a long time, well over a minute on the CLSQL data store.
+Here each new objects that is created has to independantly write its
+value to disk and accept a disk flush cost.
 
 @lisp
 (time (with-transaction ()
@@ -634,7 +800,8 @@
             (make-instance 'test :slot1 i))))
 @end lisp
 
-Here, .......
+Wrapping this operation in a transaction dramatically increases the
+time from 10's of seconds to a second or less.
 
 @lisp
 (time (with-transaction ()
@@ -642,14 +809,64 @@
             (make-instance 'test :slot1 i))))
 @end lisp
 
-These are huge differences in performance!  
-Of course since we are caching all this data in memory, we cannot have
-infinitely sized transactions.  Large operations need to get split up
-into subtransactions.  When dealing with persistent objects a good
-rule of thumb is to keep your transactions less than 1000 at a time.
+When we increase the number of objects within the transaction, the
+time cost does not go up linearly.  This is because the total time to
+write a hundred simple objects is still dominated by the final
+synchronization step.
+

[56 lines skipped]
--- /project/elephant/cvsroot/elephant/doc/user-guide.texinfo	2007/03/24 13:55:15	1.1
+++ /project/elephant/cvsroot/elephant/doc/user-guide.texinfo	2007/03/25 11:04:38	1.2
@@ -76,49 +76,71 @@
 @code{with-open-controller} macro.  Opening and closing a controller
 is very expensive.
 
+
+ at node{Class indices}
+ at comment node-name, next, previous, up
+ at section Class indicies
+
+You can enable/disable class indexing for an entire class.  When you disable
+indexing all references to instances of that class are lost.  If you re-enable
+class indexing only newly created classes will be stored in the class index.
+You can manually restore them by using @code{find-class-index} to get the 
+clas index BTree if you have an alternate in-memory index.
+
+You can add/remove a secondary index for a slot.  So long as the class index
+remains, this can be done multiple times without losing any data.
+
+There is also a facility for defining 'derived slots'.  These can be non-slot
+parameters which are a function of the class's persistent slot values.  For
+example you can use an index to keep an alternate representation available
+for fast indexing.  If an object has an x,y coordinate, you could define a
+derived index for r,theta which stored references in polar coordinates.
+These would be ordered so you could iterate over a class-index to get objects
+in order of increasing radius from the origin or over a range of theta.
+
+Beware, however, that derived indices have to compute their result every
+time you update any persistent instance's slot.  This is because there is 
+no way to know which persistent slots the derived index value(s) depends
+on.  Thus there is a fairly significant computational cost to objects
+with frequent updates having derived indices.  The storage cost, however,
+may be less as all that is added is the index value and an OID reference
+into the class index.  To add a slot value you add a serialized 
+OID+class-ref+slotname to index value which can be much larger if you
+use long slotnames and package names and unicode.
+
+Thus, the question of if and how a given class should be indexed is 
+very flexible and dynamic, and does not need to be determined at the 
+beginning of your development.  This represents the ability to ``late bind''
+the decision of what to index.  
+
+In general, there is always a tradeoff: an indexed slot increases storage
+associated with that slot and slows down write operations.  Reads however remain
+as fast as for unindexed persistent slots.  The Elephant system
+makes it simple to choose where and when one wants to utilize this tradeoff.
+
+Finally, that file @file{src/elephant/classindex-utils.lisp} documents 
+tools for handling class redefinitions and the policy that should be 
+used for synchronizing the classes with the database.  This process is
+somewhat user customizable; documentation for this exists in the source
+file referenced above.
+
 @node{Using BTrees}
 @comment node-name, next, previous, up
 @section Using BTrees
 
-The btree class are to hash-tables as persistent-objects are to
-ordinary objects. BTrees have a hash-table-like interface, but store
-their keys and values directly as rows in a Sleepycat BTree. Btrees
-may be persisted simply by their OID. Hence they have all the nice
-properties of persistent objects: identity, fast serialization /
-deserialization, no merge conflicts.....
-
-     (defvar *friends-birthdays* (make-btree))
-     => *FRIENDS-BIRTHDAYS*
-     
-     (add-to-root "friends-birthdays" *friends-birthdays*)
-     => #<BTREE {4951CF6D}>
-     
-     (setf (get-value "Andrew" *friends-birthdays*)
-       	   (encode-universal-time 0 0 0 22 12 1976))
-     => 2429071200
-     
-     (setf (get-value "Ben" *friends-birthdays*)
-     	   (encode-universal-time 0 0 0 14 4 1976))
-     => 2407298400
-     
-     (get-value "Andrew" *friends-birthdays*)
-     => 2429071200
-     => T
-     
-     (decode-universal-time *)
-     => 0
-        0
-        0
-        22
-        12
-        1976
-        2
-        NIL
-        6
-
-Because of serialization semantics, BTrees hash on a value, not
-identity. This is probably ok for strings, numbers, and persistent
-things, but may be strange for other values.
+A BTree is a data structure designed for on-disk databases.
+Traditional binary trees are a tree structure that stores a key and
+value and two links in each node.  To get to a value, you compare your
+query key to the node key.  If it is equal, return the value in this
+node.  If it is less, follow the first link and if it is greater,
+follow the second link.  The problem here is that every link requires
+a disk seek.  
+
+The BTree exploits the properties of memory/disk data heirarchies.
+The key properties are that disk seeks are expensive, loading large
+blocks of data is relatively inexpensive after a seek and comparisons
+on objects that are stored in memory is cheap.
+
 
 @node Repository Migration
 @comment node-name, next, previous, up