[Ecls-list] ECL stability tips
Matthew Mondor
mm_lists at pulsar-zone.net
Sat Nov 30 16:04:58 UTC 2013
As I'm working on a new project written in Lisp for ECL, I have
encountered some stability issues, but it seems that it's now getting
much more stable after having investigated the issues and consequently
doing some tests. Ideally this should probably eventually be
documented better and officially, but I thought I'd share these for now.
If someone can adapt some of this to actual documentation, they're
welcome. Other than the wiki, the current documentation seems to be
docbook which I personally have no experience with yet.
- One issue involved unresponsive busy loops with ENOMEM errors
encountered by ECL and libgc.
NetBSD (among other multiuser systems) implements soft/current and
hard/maximum limits on various resources and these are configurable
via login classes (login.conf) and sysctl (or setrlimit(2)) per
process. Soft limits on datasize will issue ENOMEM errors, and as an
option the process might raise the limit until the maximum limit is
reached (it seems that ECL or libgc don't do it automatically,
though, and won't adapt their own limits to the current rlimits).
The default ECL memory heap size used with Boehm-GC seems to be 1GB
(and is configurable using the --heap-size option, or with the
EXT:SET-LIMIT function. If that size is reached, ECL signals a
condition with the option to grow the heap. If the OS-specific limit
is reached before the ECL configured limit is reached, desastrous
consequences can arise, where ECL endlessly loops attempting to
signal a condition but getting ENOMEM errors doing so. This could
possibly be worked around using some preallocated memory for that
very situation. However, ensuring that the ECL heap limit is always
smaller than the OS-set limit prevents this situation.
Similar problems might occur with other limits such as the stack or
file descriptors. I suspect that reaching the fd limit is less
critical than the stack or heap limit. ECL's stack limit should also
be smaller than the OS's stacksize soft limit. These default limits
are set in src/c/main.d.
- Another problem I recently faced had to do with Boehm-GC (libgc)
locking in a deadlock when it needs to collect or grow (it then goes
through the routine GC_collect_or_expand() ->
GC_try_to_collect_inner() -> GC_stopped_mark() -> GC_stop_world()).
To stop the world and perform a garbage collection run, it uses
OS-specific code, on several POSIX systems it uses pthread_kill() to
interrupt every thread with a signal and cause them to invoke the
GC_suspend_handler() signal handler. Those threads then wait for
another signal or on a lock/condvar to resume after collection.
Unfortunately that part is complex and error-prone, and considering
all the OS-specifics libgc may be more stable and better tested on
some systems than others. The ECL-supplied gc (7.1.9 if I remember)
is slower on NetBSD than the one I had compiled from pkgsrc (7.2),
but it took me longer to reproduce the issue with it.
Stdio file descriptors have an internal buffer state, which
internally use mutexes on NetBSD when a process is linked with
libpthread. It appears that libgc has concurrency issues when stdio
is heavily used when it attempts to collect. I could not reproduce
that issue yet on Linux+glibc, but I assume libgc is also more tested
on it.
I wrote some minimal test to reproduce the issue and it indeed had to
do with threaded libgc+stdio. I then modified my application to
use :CSTREAM NIL when opening its output FIFO file, and the
application uptime was noticeably better. There however were two
remaining issues: only one character at a time was written, even
using WRITE-SEQUENCE on a LATIN-1 or PASSTHROUGH external format
(meaning in this case several thousand write(2) syscalls per second,
surrounded by other syscalls related to interrupt control). Despite
this I decided to initially stress test the application, and it had a
decent uptime, until the same issue happened again. I then noticed a
spurious fflush(3) call, which might be that of eformat_write*().
To solve both issues and be able to move forward with testing, I
wrote a small WRITE-SEQUENCE replacement using C-INLINE, as I had
done for Crow-HTTPd. Performance dramatically improved (I use custom
BASE-CHAR vectors as buffers and large direct write(2) to the
descriptor), and so far it has been stable (although it's still being
stressed tested).
To mitigate this, ECL could be made to not use stdio and use its own
buffering streams on top of file descriptors, however this would not
solve the issue where libraries used with FFI need to be supplied an
stdio FILE handle. It currently uses unbuffered file descriptors
when compiled with threads for stdin/stdout/stderr, possibly partly
because of that reason, but the comment also mentions blocking.
With some care not to use stdio in the application itself as well,
stability seems dramatically increased.
This has not been verified yet but I suspect that the occasional
locking issues I observe with C-c C-c during live interactive
development might also be related to stdio usage.
- A previously discovered issue when writing Crow-HTTPd had also been
related to libgc+threads race conditions, but at thread termination.
It seems that the Mono runtime also was affected by that on Solaris.
For simplicity I had setup Crow to avoid shrinking the threads pool,
but there might have been other solutions. However, Crow then became
very stable. It still runs my site and has uptimes as long as the
server (occasionally interrupted for security software updates).
- Probably also worth mentioning is that ECL itself avoids
synchronizing every access to potentially shared user objects, other
than where necessary like for packages. This means that obviously,
the user is responsible for providing explicit synchronization to
concurrently accessed objects, including hash tables and instance
objects, using MP primitives. This also has to be considered for
interactive development where the REPL might be used to alter the
state of live objects. Ideally a single access library should be
written which provides the synchronization, such that both the
software and REPL user use them.
- It is very important to take heed when ECL issues warnings about an
object being of type NIL. This occurs when using optimizations and
conflicting annotations exist for a variable. In case where ECL
issues this warning on a vector and the user lowers SAFETY to 1 or
below and raises SPEED, it might optimize access to inline C using
the largest native machine word (64-bit on amd64), rather than the
expected word size. On the other hand, if no large scope DECLAIM
TYPE annotation exists, every function may issue a conflicting local
scope DEFINE TYPE annotation, and ECL can allow to silently shoot
everyone in the foot at your request (even if those functions are
inlined). This can be an advantage, but it's low level enough to be
dangerous. It's possible for instance to access a byte vector using
byte-32 or byte-64 access using SAFETY 0, but it becomes your
responsibility to ensure alignment and avoid potential conflicts in
relation to the fill-pointer and dimension. Doing this is also
obviously very implementation-dependent (I tested the following which
works on ECL but fails with SBCL (obviously other than the inline C):
;;; LDB didn't optimize well here, and the chain of THE FIXNUM and
;;; LOGAND/ASH calls tedious
(defun byteorder-bswap16 (word)
(declare (optimize (speed 3) (safety 0) (debug 0))
(type (unsigned-byte 16) word))
(the (unsigned-byte 16)
#+:little-endian
(ffi:c-inline (word) (:uint16-t) :uint16-t "
uint16_t w = #0;
@(return) = ((w & 0xff00) >> 8 | (w & 0x00ff) << 8);
"
:one-liner nil
:side-effects nil)
#-:little-endian
word))
(declaim (inline get-byte8))
(defun get-byte8 (vector offset)
(declare (optimize (speed 3) (safety 0) (debug 0))
(type (vector (unsigned-byte 8) *) vector)
(type fixnum offset))
(the (unsigned-byte 8) (aref vector offset)))
(declaim (inline get-byte16))
(defun get-byte16 (vector offset)
(declare (optimize (speed 3) (safety 0) (debug 0))
(type (vector (unsigned-byte 16) *) vector)
(type fixnum offset))
(the (unsigned-byte 16) (byteorder-bswap16 (aref vector offset))))
And to supply the same byte vector to both functions. If a DECLAIM
existed for that common vector, a warning about NIL type would be
issued, and inline 64-bit access to these vectors would be generated.
If the vector isn't 16-bit aligned (or 64-bit aligned in the case of
the warning), it might cause a SIGBUS on some architectures. Thus the
reminder: if you need this kind of low level optimization, it's best to
also inspect the resulting C, and to only do it where necessary...
- For several reasons, when using the C compiler, it's useful to use
FUNCALL/APPLY with function symbols instead of direct function calls
to certain functions or direct function references, when those
functions are likely to change a lot and be recompiled (or
re-evaluated and interpreted). I.e. (FUNCALL 'FOO) vs (FUNCALL
#'FOO) or (FOO). When debugging code or testing new freshly written
modifications, newly introduced bugs, or fixes, might not become
immediately visible otherwise without recompiling the dependent code
(just as is similar when redifining a structure versus a CLOS class).
There is the case of inline functions, but also of large code blocks
or whole-file compilation which might compile to direct function
calls. Another possibility is to use the interpreter at that stage.
For the same reasons, using classes for frequently changed structures
is slower at runtime, but better for interactive development and
consistent results, than using structures.
- Occasionally SLIME might crash as if the Lisp image itself had
crashed, but once restarted with the SLIME command, detects a running
ECL and asks if we want a new image. Answering no often resumes
properly and it does not mean that the image is corrupted, but that
SWANK or SLIME bugs still exist, or that the produced output of a
REPL-entered form could not be handled. I noticed that improperly
defining PRINT-OBJECT methods can also be dangerous for stability, and
that it's often simpler to use a custom method or function instead,
at least at initial development stages. There are many requirements
for proper integrated printing, and afterall that's why
PRINT-UNREADABLE-OBJECT is quite helpful... sometimes when SLIME
can't handle a situation using the REPL directly with the embedded
debugger is still useful.
That's it for now :)
--
Matt
More information about the ecl-devel
mailing list