"Got signal before environment was installed on our thread"

Fri Sep 22 11:31:06 UTC 2017

On Thu, Sep 21, 2017 at 2:23 PM, Fabrizio Fabbri <strabixbox at yahoo.com> wrote:
>
>
> On Sep 21, 2017, at 8:31 AM, Dima Pasechnik <dimpase+ecl at gmail.com> wrote:
>
>
>
> On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox at yahoo.com>
> wrote:
>>
>>> On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl at gmail.com>
>>> wrote:
>>>
>>>> On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański
>>>> <daniel at turtleware.eu> wrote:
>>>> From the backtrace it is sure that fail is caused inside the call to
>>>> GC_init. Such errors are known to have happened when another GC was
>>>> initialized already on the system (I've linked the issue). It might be
>>>> caused by something else in bdwgc, I don't know. Either way I'd focus on
>>>> GC_init part.
>>>
>>> Our project (sagemath) only uses libgc within the embedded ECL. Thus I
>>> am really puzzled how another libgc instance might kick in and spoil
>>> the game for ECL.
>>>
>>> One possibility is that clang is using libgc, and thus, in principle,
>>> libgc might be sitting somewhere in the runtime?!
>>>
>>>
>>>>
>>>> To make sure, that I'm right with my assertion you may put printf before
>>>> and
>>>> after call to GC_init. I'm not quite familiar with bdwgc internals to
>>>> say,
>>>> what is wrong though. Maybe updating bundled sources of GC will help? Or
>>>> linking with libgc on the system? It might be that it was a bug in bdwgc
>>>> which got already fixed.
>>>
>>> We are not using the bdwgc shipped with ECL, we use a separate libgc
>>> 7.6.0, which is the latest stable.
>>> (Is there a reason to ship bdwgc sources with ECL - do you patch it?)
>>>
>>
>> I'm using ecl with the non embedded bdwgc as well and I don't have issue.
>>
>> Ensure that bdwgc it's not also build statically in ecl as well. I expect
>> linking problems in that case but worth it double check.
>
> here is a part of a stacktrace from the debugger, in a scenario where
> a call to embedded ECL from Python leads to a ECL's stack overflow, on
> an already initialised ECL; it seems to be related to a particular thread
> this call comes from (another, usual, calling sequence
> does not lead to crashes). There is no mention of GC in the stacktrace.
>
>
> If the current thread is generated outside the lisp environment you need to
> import it before call any ecl function.
> That is done by
> ecl_import_current_thread
> ecl_release_current_thread
>
Thanks for pointing this out - it's new to me!

> You could see the example here:
> https://gitlab.com/embeddable-common-lisp/ecl/tree/develop/examples/threads/import
>
> Maybe you already do that but worth mentioning that.

No, we have not done that before, and everything worked on Linux and
OSX, and even on Cygwin (that is to say, we were lucky with threads
implementations on these platforms, depending on some sort of
undefined behaviour). Now I am trying
ecl_import_current_thread/ecl_release_current_thread on FreeBSD, and
it certainly appears to be the right direction, but I have a couple of
questions, at least one of them related to signal handling.

0) any advice on signal flags to be set to certain values?
Namely, ECL_OPT_SIGNAL_HANDLING_THREAD and
ECL_OPT_THREAD_INTERRUPT_SIGNAL? They seem to affect
the setup quite a bit; I had to do some trial and error, setting the
former to 1 and the latter to 67 (probably OS-specific value) seemed
to have done the trick...

1) as ECL must be built with --enable-threads, does it mean
that it will also try to spawn threads on its own?
(so far we always used to --disable-threads; for debugging purposes
I'd rather not let ECL run its own threads)

[I'd say this is a documentation issue, too, as it's not clear what
exactly --enable-threads is doing: enabling own ECL's threads, or
enabling ECL embedding in a multithreaded program, or both?]

2) for some reason calling ecl_release_current_thread()
leads to a nasty  crash, with lines like

    frame #299974: 0x0000000883a52463
libecl.so.16.1`FElibc_error(msg="", narg=0) at error.d:490
    frame #299975: 0x0000000883ab3e2c libecl.so.16.1`ecl_process_env
at process.d:70
    frame #299976: 0x0000000883aba9d4
libecl.so.16.1`ecl_alloc_compact_object(t=t_base_string,
extra_space=12) at alloc_2.d:622
    frame #299977: 0x0000000883a8c782
libecl.so.16.1`ecl_alloc_simple_vector(l=11, aet=ecl_aet_bc) at
array.d:585
    frame #299978: 0x0000000883a5331d
libecl.so.16.1`make_base_string_copy(s="No error: 0") at string.d:136
    frame #299979: 0x0000000883a52320
libecl.so.16.1`_ecl_strerror(code=0) at error.d:475
    frame #299980: 0x0000000883a52463

repeating endlessly in the backtrace.
Must it be called at all?
(The test program in examples you pointed at does work for me, with
few makefile changes...)

3) How does one call cl_boot() in such a multithreaded setting? I
tried merely putting the call to

   ecl_import_current_thread()

before the call to

   cl_boot()

but I get an error from GC:

"Threads explicit registering is not previously enabled"
and the program aborts.
Without doing ecl_import_current_thread(), cl_boot() succeeds in "main" thread,
but coredumps if invoked from another thread---this is the behaviour
you mistook for another instance of GC kicking in)

While we probably can live with cl_boot() always being called in the
main thread, this would be an extra burden to implement...

4) GC_THREADS is #define'd both in ECL and in GC headers.
This seems wrong to me.

Thanks,
Dima

>
> Best
> F.
>
> This looks to me as a lack of thread safety on ECL side, although I might be
> wrong.
> ...
> frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6,
> cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549
> frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76
> frame #18: 0x00000008844168af
> libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658,
> env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286
> frame #19: 0x0000000884414afc
> libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658,
> x=0x0000000000db33c0) at eval.d:79
> frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0,
> fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164
> frame #21: 0x0000000883e0e1b4
> ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600,
> __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831
> frame #22: 0x0000000883e0d519
> ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf
> *load-verbose* NIL)") at ecl.c:6084
> frame #23: 0x0000000883e0d02b
> ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970,
> __pyx_skip_dispatch=0) at ecl.c:10682
> frame #24: 0x0000000883e0cd4c
> ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000,
> __pyx_v_s=0x0000000882add970) at ecl.c:10762
> frame #25: 0x0000000883e0cab7
> ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000,
> __pyx_v_s=0x0000000882add970) at ecl.c:10745
> frame #26: 0x0000000800d8a68f
> libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at
> ceval.c:4340
> frame #27: 0x0000000800d854d2
> libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at
> ceval.c:2989
> ...
> frame #91: 0x0000000800d88361
> libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0,
> arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221
> frame #92: 0x0000000800de60d1
> libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at
> threadmodule.c:620
> frame #93: 0x00000008012d3b55
> libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325
>
>
>
>>
>>> Thanks,
>>> Dima
>>>
>>>>
>>>> Regards,
>>>>
>>>> Daniel
>>>>
>>>>
>>>>
>>>>> On 04.09.2017 12:04, Dima Pasechnik wrote:
>>>>>
>>>>> On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański
>>>>> <daniel at turtleware.eu>
>>>>> wrote:
>>>>>>
>>>>>> I dont think its related to shared vs static - rather two gc running
>>>>>> concurrently. Try commenting out GC_init call in ecl and see what
>>>>>> happens.
>>>>>
>>>>> I don't understand how two GCs can run concurrently on a memory region
>>>>> controlled by ECL which is statically linked to GC...
>>>>> In fact I am pretty sure no other instances of GC are running anywhere
>>>>> within our process tree.
>>>>>
>>>>> By the way, I don't know whether it's obvious from the backtrace that
>>>>> cl_boot() has been completed, or not.
>>>>>
>>>>> If it actually was completed, could it be a bug that invalidates the
>>>>> bit indicating that cl_boot() has been done?
>>>>>
>>>>> We have seen similar troubles with clang recently, related to FPE.
>>>>> There an FPE bit was flipped by assignment of a double to an
>>>>> integer type (sic!).
>>>>> It took us a lot of head banging on various hard surfaces to debug
>>>>> this:
>>>>> https://trac.sagemath.org/ticket/22799
>>>>> it turned out we did hit a known bug:
>>>>> https://bugs.llvm.org//show_bug.cgi?id=17686
>>>>>
>>>>>> Do you need sigchld for anything? Run-program was rewritten and
>>>>>> sigchld
>>>>>> handling wasnt viable option anymore for it.
>>>>>>
>>>>> We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we
>>>>> now can simply skip it all together.
>>>>>
>>>>> Thanks,
>>>>> Dima
>>>>>
>>>>>> Im on phone, will be avail after the weekend.
>>>>>>
>>>>>> Regards, D.
>>>>>>
>>>>>>
>>>>>> Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik
>>>>>> <dimpase+ecl at gmail.com>
>>>>>> napisał(a):
>>>>>>>
>>>>>>> Hi Daniel,
>>>>>>> Thanks for the message. The scenario you talk about only happens if
>>>>>>> GC
>>>>>>> is a shared library, right?
>>>>>>>
>>>>>>> I've rebuilt GC disabling shared libs, and ECL doing static linking
>>>>>>> to
>>>>>>> GC.
>>>>>>> And I still get very similar segfaults:
>>>>>>>
>>>>>>> ;;; ECL C Backtrace
>>>>>>> ;;; 0 ecl_internal_error (0x87d79b375)
>>>>>>> ;;; 1 init_unixint (0x87d7c17e0)
>>>>>>> ;;; 2 init_unixint (0x87d7c1582)
>>>>>>> ;;; 3 pthread_sigmask (0x80103779d)
>>>>>>> ;;; 4 pthread_getspecific (0x801036d6f)
>>>>>>> ;;; 5 unknown (0x7ffffffff193)
>>>>>>> ;;; 6 GC_push_current_stack (0x87d7ef7c3)
>>>>>>> ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360)
>>>>>>> ;;; 8 GC_push_roots (0x87d7ef9c2)
>>>>>>> ;;; 9 GC_mark_some (0x87d7ec97c)
>>>>>>> ;;; 10 GC_stopped_mark (0x87d7e6b7a)
>>>>>>> ;;; 11 GC_try_to_collect_inner (0x87d7e6a75)
>>>>>>> ;;; 12 GC_init (0x87d7f08ea)
>>>>>>> ;;; 13 init_alloc (0x87d7d5669)
>>>>>>> ;;; 14 cl_boot (0x87d69f66b)
>>>>>>> ...
>>>>>>>
>>>>>>> And a very similar picture on the develop branch of ECL - although
>>>>>>> I had to change our code, as in particular
>>>>>>> ECL_OPT_TRAP_SIGCHLD is gone...
>>>>>>>
>>>>>>> So, what can it be? Some signals issue?
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Dima
>>>>>>>
>>>>>>> On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański
>>>>>>> <daniel at turtleware.eu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hey Dima,
>>>>>>>>
>>>>>>>> this looks like the issue with having GC initialized before ECL
>>>>>>>> kicks
>>>>>>>> in.
>>>>>>>> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a
>>>>>>>> discussion about this problem. Basically some other component
>>>>>>>> already
>>>>>>>> called
>>>>>>>> GC_init and ECL calls it once more. It's arguably not a bug.
>>>>>>>>
>>>>>>>> Best regards,
>>>>>>>>
>>>>>>>> Daniel
>>>>>>>>
>>>>>>>>
>>>>>>>>> On 31.08.2017 15:29, Dima Pasechnik wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Dear all,
>>>>>>>>>
>>>>>>>>> I'm struggling to understand strange segfaults coming from
>>>>>>>>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look
>>>>>>>>> as
>>>>>>>>> follows:
>>>>>>>>>
>>>>>>>>> Got signal before environment was installed on our thread
>>>>>>>>> [2: No such file or directory]
>>>>>>>>>
>>>>>>>>> ;;; ECL C Backtrace
>>>>>>>>> ;;; 0 ecl_internal_error (0x87d790765)
>>>>>>>>> ;;; 1 init_unixint (0x87d7b6bd0)
>>>>>>>>> ;;; 2 init_unixint (0x87d7b6972)
>>>>>>>>> ;;; 3 pthread_sigmask (0x80103779d)
>>>>>>>>> ;;; 4 pthread_getspecific (0x801036d6f)
>>>>>>>>> ;;; 5 unknown (0x7ffffffff193)
>>>>>>>>> ;;; 6 GC_push_all_stacks (0x87db1ea2c)
>>>>>>>>> ;;; 7 GC_mark_some (0x87db12eec)
>>>>>>>>> ;;; 8 GC_stopped_mark (0x87db09baa)
>>>>>>>>> ;;; 9 GC_try_to_collect_inner (0x87db09a75)
>>>>>>>>> ;;; 10 GC_init (0x87db16f4f)
>>>>>>>>> ;;; 11 init_alloc (0x87d7caa59)
>>>>>>>>> ;;; 12 cl_boot (0x87d694a5b)
>>>>>>>>> ;;; 13 initecl (0x87d218340)
>>>>>>>>> ;;; 14 initecl (0x87d20a43f)
>>>>>>>>> ;;; 15 initecl (0x87d207e28)
>>>>>>>>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c)
>>>>>>>>> ;;; 17 PyImport_AppendInittab (0x800b3d71f)
>>>>>>>>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8)
>>>>>>>>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce)
>>>>>>>>> ;;; 20 _PyBuiltin_Init (0x800b162d7)
>>>>>>>>> ;;; 21 PyObject_Call (0x800a7d3e3)
>>>>>>>>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c)
>>>>>>>>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4)
>>>>>>>>> ;;; 24 PyEval_EvalCode (0x800b1ad96)
>>>>>>>>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11)
>>>>>>>>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8)
>>>>>>>>> ;;; 27 PyImport_AppendInittab (0x800b3d71f)
>>>>>>>>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8)
>>>>>>>>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce)
>>>>>>>>> ;;; 30 _PyBuiltin_Init (0x800b162d7)
>>>>>>>>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1)
>>>>>>>>> Segmentation fault (core dumped)
>>>>>>>>>
>>>>>>>>> It looks as if ECL (version 16.1.2) is being called before an
>>>>>>>>> initialisation is complete, but it it possible to say more without
>>>>>>>>> a
>>>>>>>>> debugger?
>>>>>>>>>
>>>>>>>>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0
>>>>>>>>> with libatomic_ops version 7.4.6.
>>>>>>>>> And only reproducible on FreeBSD.
>>>>>>>>>
>>>>>>>>> ECL is built with --disable-threads; GC is built with or without
>>>>>>>>> threads---result is still the same.
>>>>>>>>> (so it's unclear to me where pthread_* calls in the trace
>>>>>>>>> come from).
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Dima
>>>>>>>>>
>>>>>>>>> PS. the segfault is at the bottom of
>>>>>>>>> https://trac.sagemath.org/ticket/22679#comment:87
>