Hi Hans,<br><br>Yes, that is the problem.<br><br>Here is the result of my swank session (*cheshire* being the variable holding my acceptor):<br><br><blockquote style="margin:0px 0px 0px 0.8ex;border-left:1px solid rgb(204,204,204);padding-left:1ex" class="gmail_quote">
CL-USER> (hunchentoot::taskmaster-max-thread-count (hunchentoot::acceptor-taskmaster *cheshire*))<br>
100<br>CL-USER> (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*))<br>
100<br>CL-USER> (hunchentoot::taskmaster-max-accept-count (hunchentoot::acceptor-taskmaster *cheshire*))<br>120<br>CL-USER> (setf (hunchentoot::taskmaster-request-count (hunchentoot::acceptor-taskmaster *cheshire*)) 0)<br>
0<br>CL-USER> (hunchentoot::note-free-connection (hunchentoot::acceptor-taskmaster *cheshire*))<br>0<br></blockquote><br>After that, the server resumed serving requests without any further problems.<br><br>What is weird, however, is that I never have the "Can't handle a new request, too many request threads already" error message in either of my logs.<br>
<br>This may be a long shot, but by reading handle-incoming-connection (<a href="https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L288">https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L288</a>) and create-request-handler-thread (<a href="https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355">https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355</a>), I think that if an error arises while creating a thread (such as the "getpeername": ENOTCONN [...] message), the counter is not decremented because the thread is never started (thus doesn't reach the dynamic scope of the unwind-protect).<br>
<br>I sent a pull request with the patch for this idea and will keep you posted if that solved the bug or not. I should be fixed by tomorrow since the bug occurs after roughly 9 hours of operations.<br><br>Thanks for your help.<br>
<br><div class="gmail_quote">2012/8/15 Hans Hübner <span dir="ltr"><<a href="mailto:hans.huebner@gmail.com" target="_blank">hans.huebner@gmail.com</a>></span><br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
Hi Mathieu,<br>
<br>
you may want write a log entry in the DO clause of the loop in this function:<br>
<br>
<a href="https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L250" target="_blank">https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L250</a><br>
<br>
It could be that somehow, you're running out of worker threads, but<br>
the code fails to correctly decrement the free worker count. You can<br>
also inspect request-count slot of the taskmaster when you see the<br>
hanging server again.<br>
<br>
Keep us posted about what you find.<br>
<span><font color="#888888">-Hans<br>
</font></span><div><div><br>
On Wed, Aug 15, 2012 at 6:02 PM, Mathieu Lemoine <<a href="mailto:mlemoine@mentel.com" target="_blank">mlemoine@mentel.com</a>> wrote:<br>
> Hi Hans,<br>
><br>
> I though this was the problem because of the error message I find in my<br>
> *standard-error* log:<br>
> Error while creating worker thread for new incoming connection: Socket error<br>
> in "getpeername": ENOTCONN (Transport endpoint is not connected)<br>
> However, I noticed reading your patches, this is the error message<br>
> introduced by the patch.<br>
> Since the error message is present many times, I think your code is working,<br>
> otherwise the acceptor would crash instantly.<br>
><br>
> Currently, the symptoms I saw are these:<br>
> - Timeout when trying to reach port 80 (on which Hunchentoot is listening)<br>
> - No error message in either *standard-error* nor *standard-output* logs<br>
> - Only one thread (hunchentoot listener?) running in addition to the main<br>
> process thread (as shown by htop)<br>
><br>
> This problem occurs in production environment after a few hours of<br>
> operations. I restarted it once again with a swank server this time.<br>
> Next time the problem occurs, I will try to dig in it to gather more<br>
> informations on the state of my software.<br>
><br>
> I still don't know what is causing the problem or how to reproduce it, but I<br>
> will try to check it up and keep you posted.<br>
><br>
> Thanks for your time and help.<br>
> Mathieu.<br>
><br>
><br>
> 2012/8/15 Hans Hübner <<a href="mailto:hans.huebner@gmail.com" target="_blank">hans.huebner@gmail.com</a>><br>
>><br>
>> Hi Mathieu,<br>
>><br>
>> I have mined the Subversion repository for the commits that you<br>
>> referenced, and I found them to represent this patch:<br>
>><br>
>> Index: taskmaster.lisp<br>
>> ===================================================================<br>
>> --- taskmaster.lisp (revision 4427)<br>
>> +++ taskmaster.lisp (revision 4435)<br>
>> @@ -127,9 +127,21 @@<br>
>><br>
>> #-:lispworks<br>
>> (defmethod handle-incoming-connection ((taskmaster<br>
>> one-thread-per-connection-taskmaster) socket)<br>
>> - (bt:make-thread (lambda ()<br>
>> - (process-connection (taskmaster-acceptor<br>
>> taskmaster) socket))<br>
>> - :name (format nil "Hunchentoot worker \(client:<br>
>> ~A)" (client-as-string socket))))<br>
>> + ;; We are handling all conditions here as we want to make sure that<br>
>> + ;; the acceptor process never crashes while trying to create a<br>
>> + ;; worker thread. One such problem exists in<br>
>> + ;; GET-PEER-ADDRESS-AND-PORT which can signal socket conditions on<br>
>> + ;; some platforms in certain situations.<br>
>> + (handler-case<br>
>> + (bt:make-thread (lambda ()<br>
>> + (process-connection (taskmaster-acceptor<br>
>> taskmaster) socket))<br>
>> + :name (format nil "Hunchentoot worker \(client:<br>
>> ~A)" (client-as-string socket)))<br>
>> +<br>
>> + (error (cond)<br>
>> + ;; Need to bind *ACCEPTOR* so that LOG-MESSAGE can do its work.<br>
>> + (let ((*acceptor* (taskmaster-acceptor taskmaster)))<br>
>> + (log-message *lisp-errors-log-level*<br>
>> + "Error while creating worker thread for new<br>
>> incoming connection: ~A" cond)))))<br>
>><br>
>> ;; LispWorks implementation<br>
>><br>
>> Now, the patch actually does not apply, but it was on trunk and has<br>
>> been evolved into the current version:<br>
>><br>
>> <a href="https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355" target="_blank">https://github.com/edicl/hunchentoot/blob/master/taskmaster.lisp#L355</a><br>
>><br>
>> If I understand you correctly, you say that the code does not work.<br>
>> Can you somehow provide a way to reproduce the problem?<br>
>><br>
>> Thanks,<br>
>> Hans<br>
>><br>
>> On Wed, Aug 15, 2012 at 3:29 PM, Mathieu Lemoine <<a href="mailto:mlemoine@mentel.com" target="_blank">mlemoine@mentel.com</a>><br>
>> wrote:<br>
>> > Hi,<br>
>> ><br>
>> > I am currently hitting the problem described in the following thread:<br>
>> > <a href="http://lists.common-lisp.net/pipermail/tbnl-devel/2009-July/004768.html" target="_blank">http://lists.common-lisp.net/pipermail/tbnl-devel/2009-July/004768.html</a><br>
>> ><br>
>> > I'm using SBCL 1.0.58 with hunchentoot 1.2.3 and usocket 0.5.5 (both<br>
>> > from<br>
>> > quicklisp).<br>
>> ><br>
>> > I understand the long-term fix will be some hard work. However, in this<br>
>> > thread, Hans is referencing a few diff hunks :<br>
>> ><br>
>> > <a href="http://bknr.net/trac/changeset/4430?format=diff&new=4430" target="_blank">http://bknr.net/trac/changeset/4430?format=diff&new=4430</a><br>
>> > <a href="http://bknr.net/trac/changeset/4434?format=diff&new=4434" target="_blank">http://bknr.net/trac/changeset/4434?format=diff&new=4434</a><br>
>> > <a href="http://bknr.net/trac/changeset/4435?format=diff&new=4435" target="_blank">http://bknr.net/trac/changeset/4435?format=diff&new=4435</a><br>
>> ><br>
>> > Since the old trac is not working anymore, could it be possible to<br>
>> > either<br>
>> > point me toward the equivalent commits in the github repository or the<br>
>> > content of these patches.<br>
>> ><br>
>> > By the way, Peter, if you're still around, did you need to apply only<br>
>> > 4435<br>
>> > or both this and 4434 in order to solve the second issue (unbound<br>
>> > variable<br>
>> > on *acceptor*)?<br>
>> ><br>
>> > If anybody has information with respect to these issues, including the<br>
>> > corresponding bug report in usocket, so I can do some follow up with<br>
>> > them,<br>
>> > it would be greatly appreciated.<br>
>> ><br>
>> > Thanks in advance for your help and time.<br>
>> ><br>
>> > Mathieu.<br>
>> ><br>
>> > _______________________________________________<br>
>> > tbnl-devel site list<br>
>> > <a href="mailto:tbnl-devel@common-lisp.net" target="_blank">tbnl-devel@common-lisp.net</a><br>
>> > <a href="http://common-lisp.net/mailman/listinfo/tbnl-devel" target="_blank">http://common-lisp.net/mailman/listinfo/tbnl-devel</a><br>
>><br>
>> _______________________________________________<br>
>> tbnl-devel site list<br>
>> <a href="mailto:tbnl-devel@common-lisp.net" target="_blank">tbnl-devel@common-lisp.net</a><br>
>> <a href="http://common-lisp.net/mailman/listinfo/tbnl-devel" target="_blank">http://common-lisp.net/mailman/listinfo/tbnl-devel</a><br>
><br>
><br>
><br>
> _______________________________________________<br>
> tbnl-devel site list<br>
> <a href="mailto:tbnl-devel@common-lisp.net" target="_blank">tbnl-devel@common-lisp.net</a><br>
> <a href="http://common-lisp.net/mailman/listinfo/tbnl-devel" target="_blank">http://common-lisp.net/mailman/listinfo/tbnl-devel</a><br>
<br>
_______________________________________________<br>
tbnl-devel site list<br>
<a href="mailto:tbnl-devel@common-lisp.net" target="_blank">tbnl-devel@common-lisp.net</a><br>
<a href="http://common-lisp.net/mailman/listinfo/tbnl-devel" target="_blank">http://common-lisp.net/mailman/listinfo/tbnl-devel</a><br>
</div></div></blockquote></div><br>