<div dir="ltr">Every day or so my web server is hanging due to an Elephant/BDB issue. I believe the BDB documentation has a fix for the problem that I'm working to implement.<br><br>The symptoms are as follows: The server uses 99% CPU as hundreds of threads continue to run. It appears that each is trying to access objects in the database, but these database operations are all blocking. A look at db_stat shows that there are 300+ active transactions, and a look at sb-thread:list-all-threads seems to confirm this. When I attempt to run a database query (listing objects of a certain class) from the REPL, that operation blocks as well. Strangely, when I attempt an (ele:get-from-root :question-number) I get an error: <br>
<br>There is no applicable method for the generic function <br> #<STANDARD-GENERIC-FUNCTION ELEPHANT:GET-VALUE (2)> <br>
when called with arguments <br> (:QUESTION-NUMBER NIL).<br>
[Condition of type SIMPLE-ERROR]<br><br>Restarts:<br> 0: [RETRY] Retry SLIME REPL evaluation request.<br> 1: [ABORT] Return to SLIME's top level.<br> 2: [TERMINATE-THREAD] Terminate this thread (#<THREAD "new-repl-thread" RUNNING {1003A1BD91}>)<br>
<br>Backtrace:<br> 0: ((SB-PCL::FAST-METHOD NO-APPLICABLE-METHOD (T)) #<unavailable argument> #<unavailable argument> #<ST..<br> 1: (SB-INT:SIMPLE-EVAL-IN-LEXENV (ELEPHANT:GET-FROM-ROOT :QUESTION-NUMBER) #<NULL-LEXENV>)<br>
2: (SWANK::EVAL-REGION "(ele:get-from-root<br><br>I took at look at the BDB FAQ and the following item seems relevant:<span class="boldbodycopy"><br>A transactional database environment is hanging, and no threads of control
are making progress.
</span>
<p class="bodycopy">
The most common cause of this failure is a thread of control exiting
unexpectedly, while holding a Berkeley DB mutex or a read/write logical
database lock. If a thread of control exits holding a data structure
mutex, other threads of control will likely lock up fairly quickly,
queued behind the mutex. If a thread of control exits holding a logical
database lock, other threads of control may lock up over a long period
of time, as they will not be blocked until they attempt to acquire the
specific page for which a lock is not available. See the "Deadlock
debugging" section of the Berkeley DB Reference Guide for more
information on debugging deadlocks.
</p><p class="bodycopy">
Whenever a thread of control exits m4_db holding a mutex or logical
lock, the failure must be resolved.
See the "Handling failure in
Transactional Data Store applications" section of the Berkeley DB
Reference Guide for more information.
</p><p class="bodycopy">
Finally, the Berkeley DB API is not re-entrant, and it is usually unsafe
for signal handlers to call the Berkeley DB methods. See the "Signal
handling" section of the Berkeley DB Reference Guide for more
information.</p><p class="bodycopy">---<br></p><p class="bodycopy">The solution to this problem seems to be to use DB_ENV->failchk to occasionally check for threads that have terminated without closing locks or mutexes. However, I'm not sure how this should ever occur given the UNWIND-PROTECT clauses in the current elephant system. What do you all make of this situation?</p>
<p class="bodycopy">The next step to resolving this issue requires several changes. The FFI for DB_ENV->failchk, set_thread_id, set_isalive, and set_thread_count must be implemented and set up to correctly deal with Lisp threads. This seems somewhat hairy to get working on all implementations and OSes.</p>
<p class="bodycopy">Unfortunately this issue crops up fairly frequently for me. Has anyone else run into it?</p><p class="bodycopy"><br></p><p class="bodycopy">-Red<br></p><p class="bodycopy"><br></p></div>