[Ecls-list] Possible unwinding issues?

Sat Aug 29 20:04:13 UTC 2009

It is not yet totally clear to me what causes this but I often see ECL
endlessly looping after the reception of a signal (including SIGTERM,
occasionally SIGSEGV (and I didn't discover the reasons of the crashes
generating SIGSEGV yet)).

Also, some very simple thread creation and killing test succeeds
without an apparent problem, yet an endless loop in the thread being
killed also occurs in another small program.  I noticed that when a
thread exists, unless it's the main thread, ecl_unwind() is called.
I've started wondering if perhaps there was some bug in the unwinding
code.

At the ECL REPL (not slime's which hides most of the things inside), I
also was able to produce some interesting loop until the stack was full:

stdin"> signaled an error.
Explanation: Interrupted system call.
Broken at SI:BYTECODES.No restarts available.
Broken at SI:BYTECODES.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Read or write operation to stream #<input stream "stdin"> signaled an error.
Explanation: Interrupted system call.
Broken at SI:BYTECODES.No restarts available.
Broken at SI:BYTECODES.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Read or write operation to stream #<input stream "stdin"> signaled an error.
Explanation: Interrupted system call.
Broken at SI:BYTECODES.No restarts available.
Broken at SI:BYTECODES.
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Read or write operation to stream #<input stream "stdin"> signaled an error.
[...]

This however was with 8.12.0, which appears to catch SIGINT more gracefully, after using the 1. CONTINUE restart and then issuing another SIGINT via ^C.

With CVS HEAD, the endless loop triggers immediately at the first SIGINT (here's a ktrace/kdump result):

[...]
 28996      1 ecl      CALL  read(0,0xbb703000,0x1000)
 28996      1 ecl      RET   read -1 errno 4 Interrupted system call
 28996      1 ecl      PSIG  SIGINT caught handler=0xbbb464c0 mask=(): code=SI_NOINFO
 28996      1 ecl      CALL  mprotect(0xbb8f9000,0x1c0,1)
 28996      1 ecl      RET   mprotect 0
 28996      1 ecl      CALL  setcontext(0xbfbfdd84)
 28996      1 ecl      RET   write JUSTRETURN
 28996      1 ecl      PSIG  SIGSEGV caught handler=0xbbb46610 mask=(11): code=SEGV_ACCERR, addr=0xbb8f9000, trap=6)
 28996      1 ecl      CALL  issetugid
 28996      1 ecl      RET   issetugid 0
 28996      1 ecl      CALL  issetugid
 28996      1 ecl      RET   issetugid 0
 28996      1 ecl      PSIG  SIGTERM SIG_DFL: code=SI_USER sent by pid=558, uid=0)

So the SIGINT handler is called, and soon a SIGSEGV (access error)
occurs and an endless loop without any syscall ensues, until I kill the
process with SIGTERM at which point it exits immediately.

The first time I noticed SIGSEGV followed by an endless loop was when
ecl-min compiled with __thread was crashing, so the endless loop might
well be a landmark of the SIGSEGV handler somewhere (and
jump_to_sigsegv_handler() does call ecl_unwind() as well).

I'll have to try looking more closely at this with gdb on a debug
build, but was wondering if other ECL users are also seeing similar
symptoms.

Thanks,
-- 
Matt