<div dir="ltr">Every day or so my web server is hanging due to an Elephant/BDB issue.  I believe the BDB documentation has a fix for the problem that I'm working to implement.<br><br>The symptoms are as follows: The server uses 99% CPU as hundreds of threads continue to run.  It appears that each is trying to access objects in the database, but these database operations are all blocking.  A look at db_stat shows that there are 300+ active transactions, and a look at sb-thread:list-all-threads seems to confirm this.  When I attempt to run a database query (listing objects of a certain class) from the REPL, that operation blocks as well.  Strangely, when I attempt an (ele:get-from-root :question-number) I get an error: <br>

<br>There is no applicable method for the generic function                                                                                                                                                           <br>  #<STANDARD-GENERIC-FUNCTION ELEPHANT:GET-VALUE (2)>                                                                                                                                                            <br>

when called with arguments                                                                                                                                                                                       <br>  (:QUESTION-NUMBER NIL).<br>

   [Condition of type SIMPLE-ERROR]<br><br>Restarts:<br> 0: [RETRY] Retry SLIME REPL evaluation request.<br> 1: [ABORT] Return to SLIME's top level.<br> 2: [TERMINATE-THREAD] Terminate this thread (#<THREAD "new-repl-thread" RUNNING {1003A1BD91}>)<br>

<br>Backtrace:<br>  0: ((SB-PCL::FAST-METHOD NO-APPLICABLE-METHOD (T)) #<unavailable argument> #<unavailable argument> #<ST..<br>  1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (ELEPHANT:GET-FROM-ROOT :QUESTION-NUMBER) #<NULL-LEXENV>)<br>

  2: (SWANK::EVAL-REGION "(ele:get-from-root<br><br>I took at look at the BDB FAQ and the following item seems relevant:<span class="boldbodycopy"><br>A transactional database environment is hanging, and no threads of control

are making progress.

</span>

<p class="bodycopy">

The most common cause of this failure is a thread of control exiting

unexpectedly, while holding a Berkeley DB mutex or a read/write logical

database lock.  If a thread of control exits holding a data structure

mutex, other threads of control will likely lock up fairly quickly,

queued behind the mutex.  If a thread of control exits holding a logical

database lock, other threads of control may lock up over a long period

of time, as they will not be blocked until they attempt to acquire the

specific page for which a lock is not available.  See the "Deadlock

debugging" section of the Berkeley DB Reference Guide for more

information on debugging deadlocks.

</p><p class="bodycopy">

Whenever a thread of control exits m4_db holding a mutex or logical

lock, the failure must be resolved.

See the "Handling failure in

Transactional Data Store applications" section of the Berkeley DB

Reference Guide for more information.

</p><p class="bodycopy">

Finally, the Berkeley DB API is not re-entrant, and it is usually unsafe

for signal handlers to call the Berkeley DB methods.  See the "Signal

handling" section of the Berkeley DB Reference Guide for more

information.</p><p class="bodycopy">---<br></p><p class="bodycopy">The solution to this problem seems to be to use DB_ENV->failchk to occasionally check for threads that have terminated without closing locks or mutexes.  However, I'm not sure how this should ever occur given the UNWIND-PROTECT clauses in the current elephant system.  What do you all make of this situation?</p>

<p class="bodycopy">The next step to resolving this issue requires several changes.  The FFI for DB_ENV->failchk, set_thread_id, set_isalive, and set_thread_count must be implemented and set up to correctly deal with Lisp threads.  This seems somewhat hairy to get working on all implementations and OSes.</p>

<p class="bodycopy">Unfortunately this issue crops up fairly frequently for me.  Has anyone else run into it?</p><p class="bodycopy"><br></p><p class="bodycopy">-Red<br></p><p class="bodycopy"><br></p></div>