[Ecls-list] MP stability improvement

Sat Feb 11 06:09:53 UTC 2012

On Thu, 9 Feb 2012 21:50:12 +0100
Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:

> On Thu, Feb 9, 2012 at 11:10 AM, Matthew Mondor <mm_lists at pulsar-zone.net>wrote:
> 
> > So I again tested my simplified mutex.d implementation and
> > interestingly, stability improved this time.  So what I did is to merge
> > it along with the Windows support, which still uses the old
> > holder/counter dance.  The new POSIX implementation avoids this and
> > simply relies on the POSIX primitives as directly as possible in order
> > to avoid race conditions.  I'm not familiar enough with Windows to
> > suggest patches to its implementation, though.
> >
> 
> Hi Matthew, I would like to understand what you did and in what sense it
> fixes anything.
> 
> If you have a look at the history of ECL's mutex implementation, formerly
> we would simply use POSIX. Just call the mutex routines and that's all.
> 
> Problem: there is absolutely no way to judge from POSIX whether a call to a
> mutex routine succeeded or was interrupted. What to do with the
> unwind-protect in that situation? If ECL does not keep a record of what
> calls to pthread_mutex_lock() succeeded and which ones did not, then the
> exit from with-lock will break.
> 
> I understand that the current implementation might not be very stable and
> still have errors, but removing the additional layer does not solve the
> problems, just hides some of them and causes new ones.

Sorry for not replying earlier, I was away.

I am indeed getting some occasional spurious GIVEUP-LOCK errors on one
of the locks (when threads waiting on the mutex get killed and the
unwinding code attempts to unlock a mutex that thread couldn't lock).
In this case it didn't matter as the mutexes are not recursive and
setup with PTHREAD_MUTEX_ERRORCHECK.  This would indeed be different
for recursive mutexes, however.

For non-recursive mutexes with PTHREAD_MUTEX_ERRORCHECK, non-zero is
returned if pthread_mutex_lock(3) fails, which means that we could mark
that the call succeeded, I guess, if GET-LOCK can return success.  Even
EINTR should internally get ignored by pthread_mutex_lock(3).

I've been pondering about possibly using an internal per-lock mutex for
more reliable holder/counter checking/updating at the cost of some
efficiency, I guess with interrupts disabled there, and that lock would
only be held for very short periods...

Or perhaps other avenues involving libatomic or CAS, or possibly even a
condition variable based system (assuming again we can rely on an
internal mutex).

In the case of pthread_mutex_cond[timed]wait(3), a loop must be used
checking the condition, such that spurious wakeups (or concurrent
threads) don't conflict and trigger spurious conditions...  so it
provides a rather reliable method through which synchronization events
may be communicated despite restarts (and of course it involves a mutex
internally as well)...
-- 
Matt