[elephant-cvs] CVS elephant/doc
ieslick
ieslick at common-lisp.net
Sun Mar 25 11:04:39 UTC 2007
Update of /project/elephant/cvsroot/elephant/doc
In directory clnet:/tmp/cvs-serv8936
Modified Files:
tutorial.texinfo user-guide.texinfo
Log Message:
Latest documentation edits - still very broken so don't look!
--- /project/elephant/cvsroot/elephant/doc/tutorial.texinfo 2007/03/24 12:16:02 1.7
+++ /project/elephant/cvsroot/elephant/doc/tutorial.texinfo 2007/03/25 11:04:38 1.8
@@ -6,15 +6,15 @@
@cindex Tutorial
@menu
-* Overview:: Overview of elphant's features
+* Overview:: Overview of elephant's features.
* Getting Started:: Opening and accessing a store.
* The Store Root:: Accessing persistent data.
* Serialization:: Storage semantics for lisp values.
* Persistent Classes:: Persistent semantics for objects.
* Rules about Persistent Classes:: What you need to know.
-* Persistent collections:: Keep track of collections of objects.
-* Class Indices:: Simple way to keep track of instances.
-* Using Transactions:: Enabling ACID database properties.
+* Persistent collections:: Keep track of collections of values.
+* Indexing Persistent Classes:: Simple way to keep track of persistent instances.
+* Using Transactions:: Providing ACID database properties.
@end menu
@node Overview
@@ -377,82 +377,249 @@
the most recent commits, right? Note that this can be used as a weak
form of IPC. But also note that in particular, if your slot value is
not an immediate value, reading will cons or allocate the value. Gets
-are not an expensive operation; you can perform tens of thousands of
-primitive reads per second. However, if you're concerned, cache
-large values.
+are not an expensive operation; you can perform thousands to tens of
+thousands of primitive reads per second. However, if you're
+concerned, cache large values in memory.
@node Persistent collections
@comment node-name, next, previous, up
@section Persistent collections
+The remaining problem outlined in @ref{Serialization} is that
+operations which mutate aggregate objects are not persistent. While
+we solved this problem for objects, there is no collection type such
+as arrays, hashes or lists which provide this ability. Elephant
+provides two primary types of collections, a @code{btree} and a
+ at code{indexed-btree}.
+
+We will focus on the core concepts of BTrees in this section, for a
+detailed review including the behavior of indexed BTrees, @pxref{Using
+BTrees}, @ref{Secondary Indices} and @ref{Using Cursors} in the
+ at ref{User Guide}.
+Elephant provides a rich data structure called a BTree for storing
+large sets of key-value pairs. Every key-value pair is stored
+independantly in Elephant just like persistent object slots.
+Therefore they inherit all the nice properties of persistent objects:
+identity, fast serialization / deserialization, no merge conflicts,
+etc.
+
+The primary interface to @code{btree} objects is through
+ at code{get-value}. You can also @code{setf} @code{get-value} to store
+key-value pairs.
+
+ at lisp
+(defvar *friends-birthdays* (make-btree))
+=> *FRIENDS-BIRTHDAYS*
+
+(add-to-root "friends-birthdays" *friends-birthdays*)
+=> #<BTREE @{4951CF6D@}>
+
+(setf (get-value "Ben" *friends-birthdays*)
+ (encode-universal-time 0 0 0 14 4 1973))
+=> 2312600400
+
+(setf (get-value "Andrew" *friends-birthdays*)
+ (encode-universal-time 0 0 0 22 12 1976))
+=> 2429071200
+
+(get-value "Andrew" *friends-birthdays*)
+=> 2429071200
+=> T
+
+(decode-universal-time *)
+=> 0
+ 0
+ 0
+ 22
+ 12
+ 1976
+ 2
+ NIL
+ 6
+ at end lisp
+
+In addition to the hash-table like interface, @code{btree} stores
+pairs sorted by the lisp value of the key, lowest to highest. This is
+works well for numbers, strings, symbols and persistent objects, but
+due to serialization semantics may be strange for other values like
+arrays, lists, standard-objects, etc.
+
+Because elements are sorted by value, we should be able to iterate
+over all the elements of the BTree in order. We entered the data in
+reverse alphabetic order, but will read it out in alphabetical order.
+
+ at lisp
+(map-btree (lambda (k v)
+ (format t "name: ~A utime: ~A~%" k
+ (subseq (multiple-value-list (decode-universal-time v)) 3 6)))
+ *friends-birthdays*)
+"Andrew"
+"Ben"
+=> NIL
+ at end lisp
- at node Class Indices
+But what if we want to read out our friends from oldest to youngest,
+or youngest to oldest? In the @ref{User Guide}, specifically the
+section on @ref{Secondary indices} you will discover ways to sort
+according to the order defined by a lisp function of the key-value pair.
+
+ at node Indexing Persistent Classes
@comment node-name, next, previous, up
- at section Class Indices
+ at section Indexing Persistent Classes
+
+Class indices simplify the recording and retrieving of persistent
+objects. An indexed class stores every instance of the class that is
+created, ensuring that every object is automatically persisted between
+sessions.
+
+ at lisp
+(defpclass friend ()
+ ((name :accessor name :initarg :name)
+ (birthday :initarg :birthday))
+ (:index t))
+=> #<PERSISTENT-METACLASS FRIEND>
+
+(defmethod print-object ((f friend) stream)
+ (format stream "#<~A>" (name f)))
+
+(defun encode-birthday (dmy)
+ (apply #'encode-universal-time
+ (append '(0 0 0) dmy)))
+
+(defmethod (setf birthday) (dmy (f friend))
+ (setf (slot-value f 'birthday)
+ (encode-birthday dmy))
+ dmy)
+
+(defun decode-birthday (utime)
+ (subseq (multiple-value-list (decode-universal-time utime)) 3 6))
+
+(defmethod birthday ((f friend))
+ (decode-birthday (slot-value f 'birthday)))
+ at end lisp
+
+Notice the class argument ``:index t''. This tells Elephant to store
+a reference to this class. Under the covers, there are a set of
+btrees that keep track of classes, but we won't need to worry about
+that as all the functionality has been nicely packaged for you.
+
+We also created our own birthday accessor for convenience so it
+accepts and returns birthdays in a list consisting of month, day and
+year such as @code{(27 3 1972)}. The index key will be the encoded
+universal time, however.
-Class indices are a very convenient way of gaining the efficiency that
-BTrees provide. If a given object is most often sought by the value
-of one of its slots, which is of course quite common, it is convenient
-to define a class index on that slot, although the same functionality
-can be gained in a more complicated way through the use of secondary
-indices.
-
-The file @file{tests/testindexing.lisp} provides many useful examples
-of both declaring class indexes and using the API to seek objects using them.
-
-The following code from that file in the test ``indexing-range'' demonstrates
-the convenience of a class indexes and the function ``get-instances-by-range''.
-Note in the definition of the ``slot1'' the keyword ``:index'' is used to
-specify that this slot should be indexed.
-
- at lisp
- (defclass idx-four ()
- ((slot1 :initarg :slot1 :initform 1 :accessor slot1 :index t))
- (:metaclass persistent-metaclass))
-
-
- (defun make-idx-four (val)
- (make-instance 'idx-four :slot1 val))
-
- (with-transaction ()
- (mapc #'make-idx-four '(1 1 1 2 2 4 5 5 5 6 10)))
-
- (let ((x1 (get-instances-by-range 'idx-four 'slot1 2 6))
- (x2 (get-instances-by-range 'idx-four 'slot1 0 2))
- (x3 (get-instances-by-range 'idx-four 'slot1 6 15))
- )
- (format t " x1 = ~A~%" (mapcar #'slot1 x1))
- (format t " x2 = ~A~%" (mapcar #'slot1 x2))
- (format t " x3 = ~A~%" (mapcar #'slot1 x3))
- at end lisp
-
-Additionally, the test
- at lisp
-(do-test 'INDEXING-TIMING)
- at end lisp
-Can be used to judge the performance of indexing a medium sized dataset.
-
-The file @file{src/elephant/classindex.lisp} provides the source code and
-some crisp documentation of the class indexing system.
-
-Note that for retrieving items, the API is provided by three functions:
-
- at lisp
-(defgeneric get-instances-by-class (persistent-metaclass))
-(defgeneric get-instances-by-value (persistent-metaclass slot-name value))
-(defgeneric get-instances-by-range (persistent-metaclass slot-name start end))
- at end lisp
-
-By using these functions, any class that is a subclass of persistent-metaclass
-can also be thought of as a container of all of its instances, which are
-persistent in the database between lisp invocations. Moreover an individual
-object can be looked up on O(log n) time via a value on which it is indexed.
-
-At the top of this same file, you will find the a description of the API
-which can be used to dynamically add and remove indexes. (Adding and
-removing indexes can also be performed by a re-execution of the ``defclass''
-macro with different values.)
+Now we can easily manipulate all the instances of a class.
+
+ at lisp
+(defun print-friend (friend)
+ (format t " name: ~A birthdate: ~A~%" (name friend) (birthday friend)))
+
+(make-instance 'friend :name "Carlos" :birthday (encode-birthday '(1 1 1972)))
+(make-instance 'friend :name "Adriana" :birthday (encode-birthday '(24 4 1980)))
+(make-instance 'friend :name "Zaid" :birthday (encode-birthday '(14 8 1976)))
+
+(get-instances-by-class 'friends)
+=> (#<Carlos> #<Adriana> #<Zaid>)
+
+(mapcar #'print-friend *)
+ name: Carlos birthdate: (1 1 1972)
+ name: Adriana birthdate: (24 4 1980)
+ name: Zaid birthdate: (14 8 1976)
+=> (#<Carlos> #<Adriana> #<Zaid>)
+ at end lisp
+
+But what if we have thousands of friends? Aside from never getting
+work done, our get-instances-by-class will be doing a great deal of
+consing, eating up lots of memory and wasting our time. Fortunately
+there is a more efficient way of dealing with all the instances of a
+class.
+
+ at lisp
+(map-class #'print-friend 'friend)
+ name: Carlos birthdate: (1 1 1972)
+ name: Adriana birthdate: (24 4 1980)
+ name: Zaid birthdate: (14 8 1976)
+=> NIL
+ at end lisp
+
+ at code{map-class} has the advantage that it does not keep references to
+objects after they are processed. The garbage collector can come
+along, clear references from the weak instance cache so that your
+working set is finite. The list version above conses all objects into
+memory before you can do anything with them. The deserialization
+costs are very low in both cases.
+
+Notice that the order in which the records are printed are not sorted
+according to either name or birthdate. Elephant makes no guarantee
+about the ordering of class elements, so you cannot depend on the
+insertion ordering shown here.
+
+So what if we want ordered elements? How do we access our friends
+according to name and birthdate? This is where slot indices come into
+play.
+
+ at lisp
+(defpclass friend ()
+ ((name :accessor name :initarg :name :index t)
+ (birthday :initarg :birthday :index t)))
+ at end lisp
+
+Notice the :index argument to the slots. Also notice that we dropped
+the class :index argument. Specifying that a slot is indexed
+automatically registers the class as indexed. While slot indices
+increase the cost of writes and disk storage, each entry is only
+slightly larger than the size of the slot value. Numbers, small
+strings and symbols are good candidate types for indexed slots, but
+any value may be used, even different types.
+
+Once we've indexed a slot, we can use another set of
+ at code{get-instances} and @code{map} functions to access objects
+in-order and by their slot value.
+
+ at lisp
+(get-instances-by-value 'friends 'name "Carlos")
+=> (#<Carlos>)
+
+(get-instances-by-range 'friends 'name "Adam" "Devin")
+=> (#<Adriana> #<Carlos>)
+
+(get-instances-by-range 'friend 'birthday (encode-birthday '(1 1 1974)) (encode-birthday '(31 12 1984)))
+=> (#<Zaid> #<Adriana>)
+
+(mapc #'print-friend *)
+ name: Zaid birthdate: (14 8 1976)
+ name: Adriana birthdate: (24 4 1980)
+=> (#<Zaid> #<Adriana>)
+
+(map-class-index #'print-friend 'friend 'name "Carlos" "Carlos")
+ name: Carlos birthdate: (1 1 1972)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'name "Adam" "Devin")
+ name: Adriana birthdate: (24 4 1980)
+ name: Carlos birthdate: (1 1 1972)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'birthday
+ (encode-birthday '(1 1 1974))
+ (encode-birthday '(31 12 1984)))
+ name: Zaid birthdate: (14 8 1976)
+ name: Adriana birthdate: (24 4 1980)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'birthday nil (encode-birthday '(10 10 1978)))
+ name: Carlos birthdate: (1 1 1972)
+ name: Zaid birthdate: (14 8 1976)
+=> NIL
+
+(map-class-index #'print-friend 'friend 'birthday
+ (encode-birthday '(10 10 1975))
+ nil)
+ name: Zaid birthdate: (14 8 1976)
+ name: Adriana birthdate: (24 4 1980)
+=> NIL
+ at end lisp
You can enable/disable class indexing for an entire class. When you disable
indexing all references to instances of that class are lost. If you re-enable
@@ -606,16 +773,14 @@
to retry the transaction a fixed number of times by re-executing the
whole body.
-You can see in the example that a single statement in lisp can include
-several primitive Elephant operations as in the decf statement in
-withdraw. It takes some careful thinking to properly implement
-transactions, for a complete treatment @pxref{Transaction details}.
-
-The other thing transactions can give us is the ability to put off
-synchronizing our data to disk. The expensive part of persistent
-writes is flushing data to disk. Since a transaction caches values,
-all the read and written values are kept in memory until the
-transaction is complete, this can dramatically improve performance.
+The other value transactions provide is the capability to delay
+flushing dirty data to disk. The most time-intensive part of
+persistent operations is flushing newly written data to disk. Using
+the default auto-commit behavior requires a flush on every operation
+which can become very expensive. Because a transaction caches values,
+all the values read or written are cached in memory until the
+transaction completes, dramatically decreasing the number of flushes
+and the total time taken.
@lisp
(defpclass test ()
@@ -625,8 +790,9 @@
(make-instance 'test :slot1 i)))
@end lisp
-This can take a long time. Here each new objects that is created has
-to independantly write its value to disk and accept a disk flush cost.
+This can take a long time, well over a minute on the CLSQL data store.
+Here each new objects that is created has to independantly write its
+value to disk and accept a disk flush cost.
@lisp
(time (with-transaction ()
@@ -634,7 +800,8 @@
(make-instance 'test :slot1 i))))
@end lisp
-Here, .......
+Wrapping this operation in a transaction dramatically increases the
+time from 10's of seconds to a second or less.
@lisp
(time (with-transaction ()
@@ -642,14 +809,64 @@
(make-instance 'test :slot1 i))))
@end lisp
-These are huge differences in performance!
-Of course since we are caching all this data in memory, we cannot have
-infinitely sized transactions. Large operations need to get split up
-into subtransactions. When dealing with persistent objects a good
-rule of thumb is to keep your transactions less than 1000 at a time.
+When we increase the number of objects within the transaction, the
+time cost does not go up linearly. This is because the total time to
+write a hundred simple objects is still dominated by the final
+synchronization step.
+
[56 lines skipped]
--- /project/elephant/cvsroot/elephant/doc/user-guide.texinfo 2007/03/24 13:55:15 1.1
+++ /project/elephant/cvsroot/elephant/doc/user-guide.texinfo 2007/03/25 11:04:38 1.2
@@ -76,49 +76,71 @@
@code{with-open-controller} macro. Opening and closing a controller
is very expensive.
+
+ at node{Class indices}
+ at comment node-name, next, previous, up
+ at section Class indicies
+
+You can enable/disable class indexing for an entire class. When you disable
+indexing all references to instances of that class are lost. If you re-enable
+class indexing only newly created classes will be stored in the class index.
+You can manually restore them by using @code{find-class-index} to get the
+clas index BTree if you have an alternate in-memory index.
+
+You can add/remove a secondary index for a slot. So long as the class index
+remains, this can be done multiple times without losing any data.
+
+There is also a facility for defining 'derived slots'. These can be non-slot
+parameters which are a function of the class's persistent slot values. For
+example you can use an index to keep an alternate representation available
+for fast indexing. If an object has an x,y coordinate, you could define a
+derived index for r,theta which stored references in polar coordinates.
+These would be ordered so you could iterate over a class-index to get objects
+in order of increasing radius from the origin or over a range of theta.
+
+Beware, however, that derived indices have to compute their result every
+time you update any persistent instance's slot. This is because there is
+no way to know which persistent slots the derived index value(s) depends
+on. Thus there is a fairly significant computational cost to objects
+with frequent updates having derived indices. The storage cost, however,
+may be less as all that is added is the index value and an OID reference
+into the class index. To add a slot value you add a serialized
+OID+class-ref+slotname to index value which can be much larger if you
+use long slotnames and package names and unicode.
+
+Thus, the question of if and how a given class should be indexed is
+very flexible and dynamic, and does not need to be determined at the
+beginning of your development. This represents the ability to ``late bind''
+the decision of what to index.
+
+In general, there is always a tradeoff: an indexed slot increases storage
+associated with that slot and slows down write operations. Reads however remain
+as fast as for unindexed persistent slots. The Elephant system
+makes it simple to choose where and when one wants to utilize this tradeoff.
+
+Finally, that file @file{src/elephant/classindex-utils.lisp} documents
+tools for handling class redefinitions and the policy that should be
+used for synchronizing the classes with the database. This process is
+somewhat user customizable; documentation for this exists in the source
+file referenced above.
+
@node{Using BTrees}
@comment node-name, next, previous, up
@section Using BTrees
-The btree class are to hash-tables as persistent-objects are to
-ordinary objects. BTrees have a hash-table-like interface, but store
-their keys and values directly as rows in a Sleepycat BTree. Btrees
-may be persisted simply by their OID. Hence they have all the nice
-properties of persistent objects: identity, fast serialization /
-deserialization, no merge conflicts.....
-
- (defvar *friends-birthdays* (make-btree))
- => *FRIENDS-BIRTHDAYS*
-
- (add-to-root "friends-birthdays" *friends-birthdays*)
- => #<BTREE {4951CF6D}>
-
- (setf (get-value "Andrew" *friends-birthdays*)
- (encode-universal-time 0 0 0 22 12 1976))
- => 2429071200
-
- (setf (get-value "Ben" *friends-birthdays*)
- (encode-universal-time 0 0 0 14 4 1976))
- => 2407298400
-
- (get-value "Andrew" *friends-birthdays*)
- => 2429071200
- => T
-
- (decode-universal-time *)
- => 0
- 0
- 0
- 22
- 12
- 1976
- 2
- NIL
- 6
-
-Because of serialization semantics, BTrees hash on a value, not
-identity. This is probably ok for strings, numbers, and persistent
-things, but may be strange for other values.
+A BTree is a data structure designed for on-disk databases.
+Traditional binary trees are a tree structure that stores a key and
+value and two links in each node. To get to a value, you compare your
+query key to the node key. If it is equal, return the value in this
+node. If it is less, follow the first link and if it is greater,
+follow the second link. The problem here is that every link requires
+a disk seek.
+
+The BTree exploits the properties of memory/disk data heirarchies.
+The key properties are that disk seeks are expensive, loading large
+blocks of data is relatively inexpensive after a seek and comparisons
+on objects that are stored in memory is cheap.
+
@node Repository Migration
@comment node-name, next, previous, up
More information about the Elephant-cvs
mailing list