[Ecls-list] ECL stability tips

Sat Nov 30 16:04:58 UTC 2013

As I'm working on a new project written in Lisp for ECL, I have
encountered some stability issues, but it seems that it's now getting
much more stable after having investigated the issues and consequently
doing some tests.  Ideally this should probably eventually be
documented better and officially, but I thought I'd share these for now.

If someone can adapt some of this to actual documentation, they're
welcome.  Other than the wiki, the current documentation seems to be
docbook which I personally have no experience with yet.

- One issue involved unresponsive busy loops with ENOMEM errors
  encountered by ECL and libgc.

  NetBSD (among other multiuser systems) implements soft/current and
  hard/maximum limits on various resources and these are configurable
  via login classes (login.conf) and sysctl (or setrlimit(2)) per
  process.  Soft limits on datasize will issue ENOMEM errors, and as an
  option the process might raise the limit until the maximum limit is
  reached (it seems that ECL or libgc don't do it automatically,
  though, and won't adapt their own limits to the current rlimits).

  The default ECL memory heap size used with Boehm-GC seems to be 1GB
  (and is configurable using the --heap-size option, or with the
  EXT:SET-LIMIT function.  If that size is reached, ECL signals a
  condition with the option to grow the heap.  If the OS-specific limit
  is reached before the ECL configured limit is reached, desastrous
  consequences can arise, where ECL endlessly loops attempting to
  signal a condition but getting ENOMEM errors doing so.  This could
  possibly be worked around using some preallocated memory for that
  very situation.  However, ensuring that the ECL heap limit is always
  smaller than the OS-set limit prevents this situation.

  Similar problems might occur with other limits such as the stack or
  file descriptors.  I suspect that reaching the fd limit is less
  critical than the stack or heap limit.  ECL's stack limit should also
  be smaller than the OS's stacksize soft limit.  These default limits
  are set in src/c/main.d.

- Another problem I recently faced had to do with Boehm-GC (libgc)
  locking in a deadlock when it needs to collect or grow (it then goes
  through the routine GC_collect_or_expand() ->
  GC_try_to_collect_inner() -> GC_stopped_mark() -> GC_stop_world()).

  To stop the world and perform a garbage collection run, it uses
  OS-specific code, on several POSIX systems it uses pthread_kill() to
  interrupt every thread with a signal and cause them to invoke the
  GC_suspend_handler() signal handler.  Those threads then wait for
  another signal or on a lock/condvar to resume after collection.

  Unfortunately that part is complex and error-prone, and considering
  all the OS-specifics libgc may be more stable and better tested on
  some systems than others.  The ECL-supplied gc (7.1.9 if I remember)
  is slower on NetBSD than the one I had compiled from pkgsrc (7.2),
  but it took me longer to reproduce the issue with it.

  Stdio file descriptors have an internal buffer state, which
  internally use mutexes on NetBSD when a process is linked with
  libpthread.  It appears that libgc has concurrency issues when stdio
  is heavily used when it attempts to collect.  I could not reproduce
  that issue yet on Linux+glibc, but I assume libgc is also more tested
  on it.

  I wrote some minimal test to reproduce the issue and it indeed had to
  do with threaded libgc+stdio.  I then modified my application to
  use :CSTREAM NIL when opening its output FIFO file, and the
  application uptime was noticeably better.  There however were two
  remaining issues: only one character at a time was written, even
  using WRITE-SEQUENCE on a LATIN-1 or PASSTHROUGH external format
  (meaning in this case several thousand write(2) syscalls per second,
  surrounded by other syscalls related to interrupt control).  Despite
  this I decided to initially stress test the application, and it had a
  decent uptime, until the same issue happened again.  I then noticed a
  spurious fflush(3) call, which might be that of eformat_write*().

  To solve both issues and be able to move forward with testing, I
  wrote a small WRITE-SEQUENCE replacement using C-INLINE, as I had
  done for Crow-HTTPd.  Performance dramatically improved (I use custom
  BASE-CHAR vectors as buffers and large direct write(2) to the
  descriptor), and so far it has been stable (although it's still being
  stressed tested).

  To mitigate this, ECL could be made to not use stdio and use its own
  buffering streams on top of file descriptors, however this would not
  solve the issue where libraries used with FFI need to be supplied an
  stdio FILE handle.  It currently uses unbuffered file descriptors
  when compiled with threads for stdin/stdout/stderr, possibly partly
  because of that reason, but the comment also mentions blocking.

  With some care not to use stdio in the application itself as well,
  stability seems dramatically increased.

  This has not been verified yet but I suspect that the occasional
  locking issues I observe with C-c C-c during live interactive
  development might also be related to stdio usage.

- A previously discovered issue when writing Crow-HTTPd had also been
  related to libgc+threads race conditions, but at thread termination.
  It seems that the Mono runtime also was affected by that on Solaris.
  For simplicity I had setup Crow to avoid shrinking the threads pool,
  but there might have been other solutions.  However, Crow then became
  very stable.  It still runs my site and has uptimes as long as the
  server (occasionally interrupted for security software updates).

- Probably also worth mentioning is that ECL itself avoids
  synchronizing every access to potentially shared user objects, other
  than where necessary like for packages.  This means that obviously,
  the user is responsible for providing explicit synchronization to
  concurrently accessed objects, including hash tables and instance
  objects, using MP primitives.  This also has to be considered for
  interactive development where the REPL might be used to alter the
  state of live objects.  Ideally a single access library should be
  written which provides the synchronization, such that both the
  software and REPL user use them.

- It is very important to take heed when ECL issues warnings about an
  object being of type NIL.  This occurs when using optimizations and
  conflicting annotations exist for a variable.  In case where ECL
  issues this warning on a vector and the user lowers SAFETY to 1 or
  below and raises SPEED, it might optimize access to inline C using
  the largest native machine word (64-bit on amd64), rather than the
  expected word size.  On the other hand, if no large scope DECLAIM
  TYPE annotation exists, every function may issue a conflicting local
  scope DEFINE TYPE annotation, and ECL can allow to silently shoot
  everyone in the foot at your request (even if those functions are
  inlined).  This can be an advantage, but it's low level enough to be
  dangerous.  It's possible for instance to access a byte vector using
  byte-32 or byte-64 access using SAFETY 0, but it becomes your
  responsibility to ensure alignment and avoid potential conflicts in
  relation to the fill-pointer and dimension.  Doing this is also
  obviously very implementation-dependent (I tested the following which
  works on ECL but fails with SBCL (obviously other than the inline C):

;;; LDB didn't optimize well here, and the chain of THE FIXNUM and
;;; LOGAND/ASH calls tedious
(defun byteorder-bswap16 (word)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type (unsigned-byte 16) word))
  (the (unsigned-byte 16)
    #+:little-endian
    (ffi:c-inline (word) (:uint16-t) :uint16-t "
        uint16_t w = #0;

        @(return) = ((w & 0xff00) >> 8 | (w & 0x00ff) << 8);
"
                  :one-liner nil
                  :side-effects nil)
    #-:little-endian
    word))

(declaim (inline get-byte8))
(defun get-byte8 (vector offset)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type (vector (unsigned-byte 8) *) vector)
           (type fixnum offset))
  (the (unsigned-byte 8) (aref vector offset)))

(declaim (inline get-byte16))
(defun get-byte16 (vector offset)
  (declare (optimize (speed 3) (safety 0) (debug 0))
           (type (vector (unsigned-byte 16) *) vector)
           (type fixnum offset))
  (the (unsigned-byte 16) (byteorder-bswap16 (aref vector offset))))

And to supply the same byte vector to both functions.  If a DECLAIM
existed for that common vector, a warning about NIL type would be
issued, and inline 64-bit access to these vectors would be generated.
If the vector isn't 16-bit aligned (or 64-bit aligned in the case of
the warning), it might cause a SIGBUS on some architectures.  Thus the
reminder: if you need this kind of low level optimization, it's best to
also inspect the resulting C, and to only do it where necessary...

- For several reasons, when using the C compiler, it's useful to use
  FUNCALL/APPLY with function symbols instead of direct function calls
  to certain functions or direct function references, when those
  functions are likely to change a lot and be recompiled (or
  re-evaluated and interpreted).  I.e. (FUNCALL 'FOO) vs (FUNCALL
  #'FOO) or (FOO).  When debugging code or testing new freshly written
  modifications, newly introduced bugs, or fixes, might not become
  immediately visible otherwise without recompiling the dependent code
  (just as is similar when redifining a structure versus a CLOS class).
  There is the case of inline functions, but also of large code blocks
  or whole-file compilation which might compile to direct function
  calls.  Another possibility is to use the interpreter at that stage.
  For the same reasons, using classes for frequently changed structures
  is slower at runtime, but better for interactive development and
  consistent results, than using structures.

- Occasionally SLIME might crash as if the Lisp image itself had
  crashed, but once restarted with the SLIME command, detects a running
  ECL and asks if we want a new image.  Answering no often resumes
  properly and it does not mean that the image is corrupted, but that
  SWANK or SLIME bugs still exist, or that the produced output of a
  REPL-entered form could not be handled.  I noticed that improperly
  defining PRINT-OBJECT methods can also be dangerous for stability, and
  that it's often simpler to use a custom method or function instead,
  at least at initial development stages.  There are many requirements
  for proper integrated printing, and afterall that's why
  PRINT-UNREADABLE-OBJECT is quite helpful...  sometimes when SLIME
  can't handle a situation using the REPL directly with the embedded
  debugger is still useful.

That's it for now :)

-- 
Matt