[Ecls-list] Improving ECL (and my software :)

Matthew Mondor mm_lists at pulsar-zone.net
Tue Sep 6 13:59:32 UTC 2011


Hello again,

I recently wrote a test HTTPd for ECL.  It's getting along quite well,
and can handle ~2000 requests/second on a dual Core2 system.  If it
eventually becomes very stable I intend to isolate it as a library for
use by application servers, on which frameworks and applications could
be written.  But, as it was beginning to be complete enough to write a
first stress-test dynamic application for it, it was possible to
discover some oddities.

There are two issues, one which I was already hitting all along when
developing it, which appears to be some race condition of sorts.  Even
when loading, it's possible for ECL to start looping in a busy loop or
to outright crash, yet it doesn't occur everytime.  Then eventually one
of both (stuck busy-looping or crashing) occurs randomly, but this can
take between minutes to days to occur.  If it's not a race condition,
it also could be due to some memory corruption happening somewhere in
the CL C library.

I'm not used to debugging code in gdb with as many threads and spurious
signals.  There also appears to be a problem on the NetBSD branch I'm
using with live debugging of threaded applications using gdb
(thread-related features only work properly on core dumps).  So I also
setup ECL+Emacs+SLIME+test-httpd.lisp on Linux yesterday, where perhaps
I'll find out more.

When I audited the thread locking code some weeks back, I noticed
various things which might possibly load to race conditions,
and also have written alternative mutex code.  Unfortunately, I'm not
sure that this solves any issue, for the short time I've used it I've
still seen issues.  Among the potential problems I've spotted was the
use of recursive mutexes everywhere even for non-recursive ones, along
with custom recursive counting code; also a check for the owner in
with-lock.  I attach here the alternative versions of mutex.d and
mp.lsp I also shortly tried a few weeks back.  It's possible that they
do fix some of the issues but I'm not sure yet.

Unfortunately, these kind of problems are usually the hardest to fix.
CLOS is not involved in the server code, except where standard generic
functions are used.  I would appreciate if others wish to help, audit
and/or confirm if they also experience this.  My CL experience is also
limited, having mostly used C before.  The test server code is
available and requires no external dependencies other than ECL:

cvs -z3 -d:pserver:anoncvs at cvs.pulsar-zone.net:/cvsroot co mmondor/mmsoftware/cl/server
http://cvs.pulsar-zone.net/cgi-bin/cvsweb.cgi/mmondor/mmsoftware/cl/server/

To test, simply change the options at the bottom of test-httpd.conf
(particularily the default vhost and address/port to bind).  There also
are debug options/features at the top (note that :beep is
NetBSD-specific though).  Then:

(setf *default-pathname-defaults* #P"<path-to-server-code-directory>")
(mapc #'compile-file '("dlist.lisp" "character.lisp" "html.lisp" "ecl-mp-server.lisp" "test-httpd.lisp"))
(load "test-httpd")

Then it should be ready (when it doesn't crash loading).  If the debug
feature is enabled for it (:test), then /test is available and contains
various information like passed GET/POST, version etc.  The little test
application I wrote yesterday is available as /chat.


The other problem, which I only discovered yesterday night when testing
the first dynamic application appears to be unicode related: In the
test application many messages can be entered of various length and
everything is fine.  Yet if I start copy-pasting UTF-8 from
UTF-8-demo.txt
(http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-demo.txt) then
eventually the whole document isn't sent anymore, with the browser
waiting for it to finish loading but it doesn't (a partial page
results).

At first I thought the later problem had to do with disabled
TCP_NODELAY, SO_LINGER and possibly a flushing issue, but I disabled
those options and even tried a variant using write-char and
finish-output, with similar results.  It also doesn't seem to be
related to some string size limit or the like, as if logging the
output, it seems complete.


Both problems occur on both NetBSD and Linux, so it doesn't appear to
be a kernel or libc issue.  Both systems run 32-bit software and are
i686 (one a P4 and the other a Core2).  The latest ECL from CVS/GIT is
used, built with threads and unicode support.


Thanks for any help,
-- 
Matt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mutex.d.mmondor
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110906/86c31d01/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: mp.lsp.mmondor
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110906/86c31d01/attachment-0001.ksh>


More information about the ecl-devel mailing list