[Ecls-list] Slightly disruptive change (in threads)

Fri Feb 17 19:30:21 UTC 2012

On Fri, 17 Feb 2012 10:31:12 +0100
Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:

> Is there a way I can reproduce those problems here? I am puzzled by the
> stack overflow. Why would it happen? I definitely need to gdb that.

ECL was built as follows (on NetBSD/amd64 (up to date netbsd-5 branch,
which is more recent than netbsd-5.x.x releases, but less in flux than
-current)):

$ export CFLAGS="-O2 -g"
$ export LDFLAGS='-g'
$ ./configure --prefix=/usr/local/ecl --enable-unicode --enable-threads --with-__thread=no --enable-rpath --with-system-boehm=yes --with-system-gmp=yes --with-gmp-prefix=/usr/pkg --with-dffi=system
$ make
# make install

Library versions on my system are:

gmp-5.0.2
boehm-gc-7.1
libffi-3.0.9

I used the httpd at:

http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/server/
(cvs -z3 -d:pserver:anoncvs at cvs.pulsar-zone.net:/cvsroot co mmondor/mmsoftware/cl/server)

The test.lisp file contains exemple code to build and launch the server
(which should then listen on localhost:7777, and serve files
from /tmp/htdocs/, but it also has a few dynamic tests: /test /chat
and /names).  The code is unfortunately still somewhat of a mess with
the configuration not yet separated out.

*chat-lines-file* - path to file to save /chat messages to
*name-entries-file* - path to file to save /name entries to
in HTTPD-INIT the VHOST-REGISTER call sets various parameters for the
vhost including where htdocs resides for static files.

I then tested it using a number of runs of the Apache benchmark program:
  ab -c16 -n1000 http://127.0.0.1:7777/
which rarely cause problems.  Whenever I try with -n5000 (which
interestingly doesn't increase concurrency but might perhaps continue
during gc runs or such), some pauses will randomly happen, and then the
image often has problems.  The last time it was a stack overflow error,
but at other times it can be another random uncaught exception, or an
exception about an invalid memory access handling a SIGSEGV.  Sometimes
some connections get in CLOSE_WAIT and FIN_WAIT_2 states then the HTTPd
simply cannot handle requests anymore unless restarted, and one of the
threads is stuck in a RUN loop.  It sometimes requires considerable
stress-testing during a while for any bug to show up.

I'm not sure if these are still related to boehm-gc as in the last time
I experienced crashes startup crashes on both NetBSD and Linux.
But that startup issue no longer occurs for now.

I'm unfortunately not familiar enough with boehm-gc, but is it possible
that it still uses OS-provided mutexes when ECL uses its own locks?  If
so, could it cause issues?  Or can ECL feed boehm-gc its own locking
primitives which it uses?  Does it matter at all?

Thanks,
-- 
Matt