[elephant-devel] BDB hanging with many stalled threads

Wed Oct 15 20:05:50 UTC 2008

Every day or so my web server is hanging due to an Elephant/BDB issue.  I
believe the BDB documentation has a fix for the problem that I'm working to
implement.

The symptoms are as follows: The server uses 99% CPU as hundreds of threads
continue to run.  It appears that each is trying to access objects in the
database, but these database operations are all blocking.  A look at db_stat
shows that there are 300+ active transactions, and a look at
sb-thread:list-all-threads seems to confirm this.  When I attempt to run a
database query (listing objects of a certain class) from the REPL, that
operation blocks as well.  Strangely, when I attempt an (ele:get-from-root
:question-number) I get an error:

There is no applicable method for the generic
function

  #<STANDARD-GENERIC-FUNCTION ELEPHANT:GET-VALUE
(2)>

when called with
arguments

  (:QUESTION-NUMBER NIL).
   [Condition of type SIMPLE-ERROR]

Restarts:
 0: [RETRY] Retry SLIME REPL evaluation request.
 1: [ABORT] Return to SLIME's top level.
 2: [TERMINATE-THREAD] Terminate this thread (#<THREAD "new-repl-thread"
RUNNING {1003A1BD91}>)

Backtrace:
  0: ((SB-PCL::FAST-METHOD NO-APPLICABLE-METHOD (T)) #<unavailable argument>
#<unavailable argument> #<ST..
  1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (ELEPHANT:GET-FROM-ROOT :QUESTION-NUMBER)
#<NULL-LEXENV>)
  2: (SWANK::EVAL-REGION "(ele:get-from-root

I took at look at the BDB FAQ and the following item seems relevant:
A transactional database environment is hanging, and no threads of control
are making progress.

The most common cause of this failure is a thread of control exiting
unexpectedly, while holding a Berkeley DB mutex or a read/write logical
database lock. If a thread of control exits holding a data structure mutex,
other threads of control will likely lock up fairly quickly, queued behind
the mutex. If a thread of control exits holding a logical database lock,
other threads of control may lock up over a long period of time, as they
will not be blocked until they attempt to acquire the specific page for
which a lock is not available. See the "Deadlock debugging" section of the
Berkeley DB Reference Guide for more information on debugging deadlocks.

Whenever a thread of control exits m4_db holding a mutex or logical lock,
the failure must be resolved. See the "Handling failure in Transactional
Data Store applications" section of the Berkeley DB Reference Guide for more
information.

Finally, the Berkeley DB API is not re-entrant, and it is usually unsafe for
signal handlers to call the Berkeley DB methods. See the "Signal handling"
section of the Berkeley DB Reference Guide for more information.

---

The solution to this problem seems to be to use DB_ENV->failchk to
occasionally check for threads that have terminated without closing locks or
mutexes.  However, I'm not sure how this should ever occur given the
UNWIND-PROTECT clauses in the current elephant system.  What do you all make
of this situation?

The next step to resolving this issue requires several changes.  The FFI for
DB_ENV->failchk, set_thread_id, set_isalive, and set_thread_count must be
implemented and set up to correctly deal with Lisp threads.  This seems
somewhat hairy to get working on all implementations and OSes.

Unfortunately this issue crops up fairly frequently for me.  Has anyone else
run into it?

-Red
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/elephant-devel/attachments/20081015/41ca125a/attachment.html>