|Anonymous | Login | Signup for a new account||2016-09-29 22:27 CEST|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0004690||OCaml||OCaml general||public||2009-01-11 10:21||2011-05-29 12:14|
|Target Version||Fixed in Version||3.11.1+dev|
|Summary||0004690: Bus error in caml_oldify_local_roots when using native dynamic loading on Mac OS 10.5|
|Description||On Mac OS 10.5, using the ocaml-based coq system, some calls to dynamically loaded functions result in a bus error in caml_oldify_local_roots (line 191 of roots.c, d is NULL in "if (d->retaddr == retaddr) break;").|
This has been reproduced in exactly the same situation on two different installations of Mac OS 10.5 but the problem, even on a given installation, is sensible to the execution context. For instance, changing the name of the files or directories may change the way the problem appears. I was unable to find a simple example (I guess that the program has to run for a while to set the gc in the faulty configuration).
The simplest way to certainly reproduce the problem is to export coq svn trunk revision 11773 (svn checkout -r 11773 svn://scm.gforge.inria.fr/svn/coq/trunk [^]) with ocaml 3.11 and camlp5 5.10 installed, then "configure -local; make". Depending on the installation context, a problem appears either while compiling coq file theories/Classes/RelationClasses.v or theories/Logic/ChoiceFacts.v or theories/ZArith/Zdiv.v, ... (see also http://logical.saclay.inria.fr/coq-bugs/show_bug.cgi?id=2024 [^]). My own installation is 10.5.6 and Xcode Tools 3.2.1 on Core 2 Duo. Statistically, the compilation of the coq system is large enough to trigger the problem at least once (I no nobody who has succeeded to compile the recent modularisation of coq using dynamic loading on Mac OS 10.5 in full yet). No problem at all has been encountered on Mac OS 10.4.
The trace before crashing is the following:
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
caml_oldify_local_roots () at roots.c:191
191 if (d->retaddr == retaddr) break;
(gdb) bt full
#0 caml_oldify_local_roots () at roots.c:191
sp = 0xbfffe7f0 "\\\005j"
retaddr = 18865600
regs = (value *) 0xbfffe7d0
d = (frame_descr *) 0x0
h = 261048
i = 283
j = <value temporarily unavailable, due to optimizations>
n = 0
ofs = <value temporarily unavailable, due to optimizations>
p = <value temporarily unavailable, due to optimizations>
glob = <value temporarily unavailable, due to optimizations>
root = <value temporarily unavailable, due to optimizations>
lr = <value temporarily unavailable, due to optimizations>
lnk = (link *) 0x0
#1 0x00306234 in caml_empty_minor_heap () at minor_gc.c:229
r = <value temporarily unavailable, due to optimizations>
#2 0x00306385 in caml_minor_collection () at minor_gc.c:272
prev_alloc_words = 0
0000003 0x003041ad in caml_garbage_collection () at signals_asm.c:68
0000004 0x00313aeb in caml_call_gc ()
No symbol table info available.
Cannot access memory at address 0x5
|Tags||No tags attached.|
|Attached Files||ocaml-3.11.0-macosx-dynlink.patch [^] (4,421 bytes) 2009-01-26 18:09 [Show Content]|
After further investigations, the problem can be traced back to Mac OS 10.5's dlopen which binds the redundant generic functions of the caml shared startup code to location outside the dynamic module (hence presumably in the main executable). Accordingly, dlopen binds the return addresses listed in the frame table to locations outside the dynamic module. On its side, the gc expects these return addresses to be located within the dynamically loaded module. Therefore, the gc fails to find in the frame table the return addresses it is looking for.
Surprisingly, the last return address of the frame table (which corresponds to the smaller label in the code for the generic functions) is bound yet elsewhere.
A fix which worked for us was first to replace the .globl directives declaring the generic function symbols in the startup file by .private_extern directives, secondly to add a dummy instruction before the declaration of the first generic function so as to have the last return address of the frame table bound locally too.
Thanks for the detailed bug report and repro case. For the record, here is an analysis of what happens.
Consider a DLL that defines a function "f", which contains a GC point identified by the label "L100", therefore referenced from the table of frame descriptors for the DLL:
Now, it happens that the main program also defines function "f". ("f" is one of the caml_curry/caml_tuplify/etc combinators.) References to "f" within the DLL can therefore be resolved either to the "f" of the DLL or that of the main program. Which one is chosen depends on the context of the reference to "f" and also on the OS. But in itself it doesn't matter since the two "f" have exactly the same code.
The real cause of the bug is that MacOS 10.5 incorrectly resolves the reference to "L100" found in the frametable of the DLL: rather than resolve it to a pointer within the DLL's "f" function, it resolves it as a pointer within the main program's "f" function. Therefore, the GC point within the DLL's "f" is not correctly described, causing the observed crash in caml_oldify_local_roots.
I'm pretty sure this is a bug in MacOS 10.5: both 10.4 and ELF-based systems (Linux) correctly treat the reference to "L100" as local to the DLL, which it is.
A workaround is in preparation; to be continued.
Implemented a fix (based on .private_extern, as suggested) in 3.11 release branch. With this fix, compilation of Coq 11773 succeeds. Tested only on x86 32 bits. To facilitate testing, a patch against 3.11.0 is attached.
|I tried the patch (Core 2 duo, MacOS 10.5.6, Xcode 3.2.1) and I still fail to compile Coq. My understanding was that there is a second bug on top of the first one: the labels in the scope of the first generic function are treated differently from the other labels and get a wrong location even when the generic functions are declared private. A fix that worked for me is to add a dummy nop instruction between the _caml_shared_startup__code_begin label and the label of the first generic function. Without this extra fix, I still eventually get a bus error. Hope it helps.|
Added "nop" as suggested by Hugo to address the remaining problem.
(What happened is probably the following: if two labels point to the same location, and one is private_extern, and the other is global, the first one automatically becomes global.)
|2009-01-11 10:21||herbelin||New Issue|
|2009-01-13 11:46||herbelin||Note Added: 0004812|
|2009-01-26 17:05||xleroy||Note Added: 0004822|
|2009-01-26 17:05||xleroy||Assigned To||=> xleroy|
|2009-01-26 17:05||xleroy||Status||new => confirmed|
|2009-01-26 18:09||xleroy||Note Added: 0004823|
|2009-01-26 18:09||xleroy||Status||confirmed => resolved|
|2009-01-26 18:09||xleroy||File Added: ocaml-3.11.0-macosx-dynlink.patch|
|2009-01-27 18:59||herbelin||Note Added: 0004825|
|2009-01-27 18:59||herbelin||Status||resolved => feedback|
|2009-01-27 18:59||herbelin||Resolution||open => reopened|
|2009-03-28 16:21||xleroy||Note Added: 0004866|
|2009-03-28 16:21||xleroy||Status||feedback => resolved|
|2009-03-28 16:21||xleroy||Resolution||reopened => fixed|
|2009-04-19 10:57||xleroy||Fixed in Version||=> 3.11.1+dev|
|2011-05-29 12:14||xleroy||Status||resolved => closed|
|Copyright © 2000 - 2011 MantisBT Group|