[Ecls-list] ecl with old libc: deadlock in gc due to signal handling
Anton Vodonosov
avodonosov at yandex.ru
Wed Jan 19 18:21:19 UTC 2011
Hello.
I am building ECL for glibc-2.2.5. With that old glibc version
a deadlock occurs any time when garbage collection starts.
I found out the mechanics of how it happens.
Not sure if you want to fix it, because the libc version is old,
but maybe you can provide an advice how can I workaround it.
How it happens. Two parts are involved:
1. The Boehm-Weiser GC tries to stop all the threads before
performing garbage collection (it's called "stop world").
This is implemented by sending a SIG_SUSPEND signal to
every thread. The signal handler in every thread then
tells "ok, I am stopped" to the thread which wants to perform
the garbage collection, and then waits until the GC instruct
it to restart.
The "I am stopped" confirmation is sent via a
semaphore: sem_post(&GC_suspend_ack_sem).
The GC expects this from every thread. It performs
sem_wait(&GC_suspend_ack_sem) as many times, as
many threads were notified by the SIG_SUSPEND signal.
The corresponding code is in the src/gc/pthread_stop_world.c,
the functions GC_stop_world which calls GC_suspend_all.
The signal handler behavior is implemented in the
GC_suspend_handler_inner.
2. ECL has a special thread which handles all the signals
not handled by other threads.
See it's implementation in the function
asynchronous_signal_servicing_thread, file src/c/unixint.d.
It is an endless loop of
sigwait(<signlals blocked in other threads>);
The deadlock is caused by the difference in sigwait behavior
between the old libc and the contemporary libc.
Namely, what happens when the asynchronous_signal_servicing_thread
is waiting in sigwait(<signlals blocked in other threads>),
and some signal _not_ from this set arrive? In particular, when
GC sends the SIG_SUSPEND signal.
The contemporary libc calls the signal handler. The old libc
doesn't call the signal handler; sigwait just blocks
the signlas other than it waits for.
In result, with the old libc the sem_wait(&GC_suspend_ack_sem)
is not performed by the asynchronous_signal_servicing_thread,
therefore the GC waits on the semaphore forever.
ECL hangs first time the GC is invoked, for example on
(MAKE-ARRAY 3000000).
What would be the easiest way to workaround this problem?
Best regards,
- Anton
More information about the ecl-devel
mailing list