[Ecls-list] ecl with old libc: deadlock in gc due to signal handling

Anton Vodonosov avodonosov at yandex.ru
Wed Jan 19 18:21:19 UTC 2011


Hello.

I am building ECL for glibc-2.2.5. With that old glibc version 
a deadlock occurs any time when garbage collection starts.

I found out the mechanics of how it happens.

Not sure if you want to fix it, because the libc version is old,
but maybe you can provide an advice how can I workaround it.

How it happens. Two parts are involved:

1. The Boehm-Weiser GC tries to stop all the threads before 
   performing garbage collection (it's called "stop world"). 
   This is implemented by sending a SIG_SUSPEND signal to 
   every thread. The signal handler in every thread then 
   tells "ok, I am stopped" to the thread which wants to perform 
   the garbage collection, and then waits until the GC instruct 
   it to restart.

   The "I am stopped" confirmation is sent via a
   semaphore: sem_post(&GC_suspend_ack_sem).

   The GC expects this from every thread. It performs
   sem_wait(&GC_suspend_ack_sem) as many times, as
   many threads were notified by the SIG_SUSPEND signal.

   The corresponding code is in the src/gc/pthread_stop_world.c,
   the functions GC_stop_world which calls GC_suspend_all.
   The signal handler behavior is implemented in the
   GC_suspend_handler_inner.

2. ECL has a special thread which handles all the signals
   not handled by other threads.

   See it's implementation in the function
   asynchronous_signal_servicing_thread, file src/c/unixint.d.

   It is an endless loop of 
      sigwait(<signlals blocked in other threads>);

The deadlock is caused by the difference in sigwait behavior
between the old libc and the contemporary libc.

Namely, what happens when the asynchronous_signal_servicing_thread
is waiting in sigwait(<signlals blocked in other threads>),
and some signal _not_ from this set arrive? In particular, when 
GC sends the SIG_SUSPEND signal.

The contemporary libc calls the signal handler. The old libc
doesn't call the signal handler; sigwait just blocks
the signlas other than it waits for. 

In result, with the old libc the sem_wait(&GC_suspend_ack_sem) 
is  not performed by the asynchronous_signal_servicing_thread, 
therefore the GC waits on the semaphore forever.

ECL hangs first time the GC is invoked, for example on 
(MAKE-ARRAY 3000000).

What would be the easiest way to workaround this problem?

Best regards,
- Anton








More information about the ecl-devel mailing list