"Got signal before environment was installed on our thread"

Thu Sep 21 13:23:37 UTC 2017

> On Sep 21, 2017, at 8:31 AM, Dima Pasechnik <dimpase+ecl at gmail.com> wrote:
> 
> 
> 
> On Tue, Sep 12, 2017 at 1:18 AM, Fabrizio Fabbri <strabixbox at yahoo.com> wrote:
> >
> >> On Sep 11, 2017, at 7:13 PM, Dima Pasechnik <dimpase+ecl at gmail.com> wrote:
> >>
> >>> On Mon, Sep 4, 2017 at 11:15 AM, Daniel Kochmański <daniel at turtleware.eu> wrote:
> >>> From the backtrace it is sure that fail is caused inside the call to
> >>> GC_init. Such errors are known to have happened when another GC was
> >>> initialized already on the system (I've linked the issue). It might be
> >>> caused by something else in bdwgc, I don't know. Either way I'd focus on
> >>> GC_init part.
> >>
> >> Our project (sagemath) only uses libgc within the embedded ECL. Thus I
> >> am really puzzled how another libgc instance might kick in and spoil
> >> the game for ECL.
> >>
> >> One possibility is that clang is using libgc, and thus, in principle,
> >> libgc might be sitting somewhere in the runtime?!
> >>
> >>
> >>>
> >>> To make sure, that I'm right with my assertion you may put printf before and
> >>> after call to GC_init. I'm not quite familiar with bdwgc internals to say,
> >>> what is wrong though. Maybe updating bundled sources of GC will help? Or
> >>> linking with libgc on the system? It might be that it was a bug in bdwgc
> >>> which got already fixed.
> >>
> >> We are not using the bdwgc shipped with ECL, we use a separate libgc
> >> 7.6.0, which is the latest stable.
> >> (Is there a reason to ship bdwgc sources with ECL - do you patch it?)
> >>
> >
> > I'm using ecl with the non embedded bdwgc as well and I don't have issue..
> >
> > Ensure that bdwgc it's not also build statically in ecl as well. I expect linking problems in that case but worth it double check.
> 
> here is a part of a stacktrace from the debugger, in a scenario where
> a call to embedded ECL from Python leads to a ECL's stack overflow, on
> an already initialised ECL; it seems to be related to a particular thread this call comes from (another, usual, calling sequence
> does not lead to crashes). There is no mention of GC in the stacktrace.
> 

If the current thread is generated outside the lisp environment you need to import it before call any ecl function.
That is done by 
ecl_import_current_thread
ecl_release_current_thread

You could see the example here:
https://gitlab.com/embeddable-common-lisp/ecl/tree/develop/examples/threads/import

Maybe you already do that but worth mentioning that.

Best
F.
> This looks to me as a lack of thread safety on ECL side, although I might be wrong.
> ...
> frame #16: 0x000000088444b9d6 libecl.so.16.1`si_serror(narg=6, cformat=0x0000000000d27ba0, eformat=0x00000008847d12a0) at error.d:549
> frame #17: 0x000000088448bd42 libecl.so.16.1`ecl_cs_overflow at stacks.d:76
> frame #18: 0x00000008844168af libecl.so.16.1`ecl_interpret(frame=0x00007fffdeff2658, env=0x0000000000000001, bytecodes=0x0000000000db33c0) at interpreter.d:286
> frame #19: 0x0000000884414afc libecl.so.16.1`ecl_apply_from_stack_frame(frame=0x00007fffdeff2658, x=0x0000000000db33c0) at eval.d:79
> frame #20: 0x000000088441545b libecl.so.16.1`cl_apply(narg=0, fun=0x0000000000db33c0, lastarg=0x0000000000000001) at eval.d:164
> frame #21: 0x0000000883e0e1b4 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_funcall(__pyx_v_func=0x0000000000769600, __pyx_v_arg=0x0000000000e6dfa0) at ecl.c:5831
> frame #22: 0x0000000883e0d519 ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_safe_read_string(__pyx_v_s="(setf *load-verbose* NIL)") at ecl.c:6084
> frame #23: 0x0000000883e0d02b ecl.so`__pyx_f_4sage_4libs_3ecl_ecl_eval(__pyx_v_s=0x0000000882add970, __pyx_skip_dispatch=0) at ecl.c:10682
> frame #24: 0x0000000883e0cd4c ecl.so`__pyx_pf_4sage_4libs_3ecl_10ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10762
> frame #25: 0x0000000883e0cab7 ecl.so`__pyx_pw_4sage_4libs_3ecl_11ecl_eval(__pyx_self=0x0000000000000000, __pyx_v_s=0x0000000882add970) at ecl.c:10745
> frame #26: 0x0000000800d8a68f libpython2.7.so.1`call_function(pp_stack=0x00007fffdeff2c00, oparg=1) at ceval.c:4340
> frame #27: 0x0000000800d854d2 libpython2.7.so.1`PyEval_EvalFrameEx(f=0x00000008829939b0, throwflag=0) at ceval.c:2989
> ...
> frame #91: 0x0000000800d88361 libpython2.7.so.1`PyEval_CallObjectWithKeywords(func=0x000000087cdf99e0, arg=0x000000080064e060, kw=0x0000000000000000) at ceval.c:4221
> frame #92: 0x0000000800de60d1 libpython2.7.so.1`t_bootstrap(boot_raw=0x0000000807015598) at threadmodule.c:620
> frame #93: 0x00000008012d3b55 libthr.so.3`___lldb_unnamed_symbol1$$libthr.so.3 + 325
> 
> 
> 
> >
> >> Thanks,
> >> Dima
> >>
> >>>
> >>> Regards,
> >>>
> >>> Daniel
> >>>
> >>>
> >>>
> >>>> On 04.09.2017 12:04, Dima Pasechnik wrote:
> >>>>
> >>>> On Fri, Sep 1, 2017 at 1:57 PM, Daniel Kochmański <daniel at turtleware.eu>
> >>>> wrote:
> >>>>>
> >>>>> I dont think its related to shared vs static - rather two gc running
> >>>>> concurrently. Try commenting out GC_init call in ecl and see what
> >>>>> happens.
> >>>>
> >>>> I don't understand how two GCs can run concurrently on a memory region
> >>>> controlled by ECL which is statically linked to GC...
> >>>> In fact I am pretty sure no other instances of GC are running anywhere
> >>>> within our process tree.
> >>>>
> >>>> By the way, I don't know whether it's obvious from the backtrace that
> >>>> cl_boot() has been completed, or not.
> >>>>
> >>>> If it actually was completed, could it be a bug that invalidates the
> >>>> bit indicating that cl_boot() has been done?
> >>>>
> >>>> We have seen similar troubles with clang recently, related to FPE.
> >>>> There an FPE bit was flipped by assignment of a double to an
> >>>> integer type (sic!).
> >>>> It took us a lot of head banging on various hard surfaces to debug this:
> >>>> https://trac.sagemath.org/ticket/22799
> >>>> it turned out we did hit a known bug:
> >>>> https://bugs.llvm.org//show_bug.cgi?id=17686
> >>>>
> >>>>> Do you need sigchld for anything? Run-program was rewritten and sigchld
> >>>>> handling wasnt viable option anymore for it.
> >>>>>
> >>>> We do set ECL_OPT_TRAP_SIGCHLD to 0, thus I presume we
> >>>> now can simply skip it all together.
> >>>>
> >>>> Thanks,
> >>>> Dima
> >>>>
> >>>>> Im on phone, will be avail after the weekend.
> >>>>>
> >>>>> Regards, D.
> >>>>>
> >>>>>
> >>>>> Dnia 1 września 2017 14:47:57 CEST, Dima Pasechnik
> >>>>> <dimpase+ecl at gmail.com>
> >>>>> napisał(a):
> >>>>>>
> >>>>>> Hi Daniel,
> >>>>>> Thanks for the message. The scenario you talk about only happens if GC
> >>>>>> is a shared library, right?
> >>>>>>
> >>>>>> I've rebuilt GC disabling shared libs, and ECL doing static linking to
> >>>>>> GC.
> >>>>>> And I still get very similar segfaults:
> >>>>>>
> >>>>>> ;;; ECL C Backtrace
> >>>>>> ;;; 0 ecl_internal_error (0x87d79b375)
> >>>>>> ;;; 1 init_unixint (0x87d7c17e0)
> >>>>>> ;;; 2 init_unixint (0x87d7c1582)
> >>>>>> ;;; 3 pthread_sigmask (0x80103779d)
> >>>>>> ;;; 4 pthread_getspecific (0x801036d6f)
> >>>>>> ;;; 5 unknown (0x7ffffffff193)
> >>>>>> ;;; 6 GC_push_current_stack (0x87d7ef7c3)
> >>>>>> ;;; 7 GC_with_callee_saves_pushed (0x87d7f7360)
> >>>>>> ;;; 8 GC_push_roots (0x87d7ef9c2)
> >>>>>> ;;; 9 GC_mark_some (0x87d7ec97c)
> >>>>>> ;;; 10 GC_stopped_mark (0x87d7e6b7a)
> >>>>>> ;;; 11 GC_try_to_collect_inner (0x87d7e6a75)
> >>>>>> ;;; 12 GC_init (0x87d7f08ea)
> >>>>>> ;;; 13 init_alloc (0x87d7d5669)
> >>>>>> ;;; 14 cl_boot (0x87d69f66b)
> >>>>>> ...
> >>>>>>
> >>>>>> And a very similar picture on the develop branch of ECL - although
> >>>>>> I had to change our code, as in particular
> >>>>>> ECL_OPT_TRAP_SIGCHLD is gone...
> >>>>>>
> >>>>>> So, what can it be? Some signals issue?
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Dima
> >>>>>>
> >>>>>> On Fri, Sep 1, 2017 at 7:38 AM, Daniel Kochmański <daniel at turtleware.eu>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>> Hey Dima,
> >>>>>>>
> >>>>>>> this looks like the issue with having GC initialized before ECL kicks
> >>>>>>> in.
> >>>>>>> See https://gitlab.com/embeddable-common-lisp/ecl/issues/371 for a
> >>>>>>> discussion about this problem. Basically some other component already
> >>>>>>> called
> >>>>>>> GC_init and ECL calls it once more. It's arguably not a bug.
> >>>>>>>
> >>>>>>> Best regards,
> >>>>>>>
> >>>>>>> Daniel
> >>>>>>>
> >>>>>>>
> >>>>>>>> On 31.08.2017 15:29, Dima Pasechnik wrote:
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> Dear all,
> >>>>>>>>
> >>>>>>>> I'm struggling to understand strange segfaults coming from
> >>>>>>>> ECL(+Maxima) on FreeBSD embedded into Python; they typically look as
> >>>>>>>> follows:
> >>>>>>>>
> >>>>>>>> Got signal before environment was installed on our thread
> >>>>>>>> [2: No such file or directory]
> >>>>>>>>
> >>>>>>>> ;;; ECL C Backtrace
> >>>>>>>> ;;; 0 ecl_internal_error (0x87d790765)
> >>>>>>>> ;;; 1 init_unixint (0x87d7b6bd0)
> >>>>>>>> ;;; 2 init_unixint (0x87d7b6972)
> >>>>>>>> ;;; 3 pthread_sigmask (0x80103779d)
> >>>>>>>> ;;; 4 pthread_getspecific (0x801036d6f)
> >>>>>>>> ;;; 5 unknown (0x7ffffffff193)
> >>>>>>>> ;;; 6 GC_push_all_stacks (0x87db1ea2c)
> >>>>>>>> ;;; 7 GC_mark_some (0x87db12eec)
> >>>>>>>> ;;; 8 GC_stopped_mark (0x87db09baa)
> >>>>>>>> ;;; 9 GC_try_to_collect_inner (0x87db09a75)
> >>>>>>>> ;;; 10 GC_init (0x87db16f4f)
> >>>>>>>> ;;; 11 init_alloc (0x87d7caa59)
> >>>>>>>> ;;; 12 cl_boot (0x87d694a5b)
> >>>>>>>> ;;; 13 initecl (0x87d218340)
> >>>>>>>> ;;; 14 initecl (0x87d20a43f)
> >>>>>>>> ;;; 15 initecl (0x87d207e28)
> >>>>>>>> ;;; 16 _PyImport_LoadDynamicModule (0x800b3ed1c)
> >>>>>>>> ;;; 17 PyImport_AppendInittab (0x800b3d71f)
> >>>>>>>> ;;; 18 PyImport_AppendInittab (0x800b3d1a8)
> >>>>>>>> ;;; 19 PyImport_ImportModuleLevel (0x800b3c2ce)
> >>>>>>>> ;;; 20 _PyBuiltin_Init (0x800b162d7)
> >>>>>>>> ;;; 21 PyObject_Call (0x800a7d3e3)
> >>>>>>>> ;;; 22 PyEval_EvalFrameEx (0x800b2121c)
> >>>>>>>> ;;; 23 PyEval_EvalCodeEx (0x800b1b5d4)
> >>>>>>>> ;;; 24 PyEval_EvalCode (0x800b1ad96)
> >>>>>>>> ;;; 25 PyImport_ExecCodeModuleEx (0x800b3ad11)
> >>>>>>>> ;;; 26 PyImport_AppendInittab (0x800b3ddb8)
> >>>>>>>> ;;; 27 PyImport_AppendInittab (0x800b3d71f)
> >>>>>>>> ;;; 28 PyImport_AppendInittab (0x800b3d1a8)
> >>>>>>>> ;;; 29 PyImport_ImportModuleLevel (0x800b3c2ce)
> >>>>>>>> ;;; 30 _PyBuiltin_Init (0x800b162d7)
> >>>>>>>> ;;; 31 PyEval_EvalFrameEx (0x800b22dd1)
> >>>>>>>> Segmentation fault (core dumped)
> >>>>>>>>
> >>>>>>>> It looks as if ECL (version 16.1.2) is being called before an
> >>>>>>>> initialisation is complete, but it it possible to say more without a
> >>>>>>>> debugger?
> >>>>>>>>
> >>>>>>>> More details: is is on FreeBSD 11.0, clang 3.8.0, GC version 7.6.0
> >>>>>>>> with libatomic_ops version 7.4.6.
> >>>>>>>> And only reproducible on FreeBSD.
> >>>>>>>>
> >>>>>>>> ECL is built with --disable-threads; GC is built with or without
> >>>>>>>> threads---result is still the same.
> >>>>>>>> (so it's unclear to me where pthread_* calls in the trace
> >>>>>>>> come from).
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Dima
> >>>>>>>>
> >>>>>>>> PS. the segfault is at the bottom of
> >>>>>>>> https://trac.sagemath.org/ticket/22679#comment:87
> 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20170921/71732097/attachment-0001.html>