Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bus error in caml_oldify_local_roots when using native dynamic loading on Mac OS 10.5 #4690

Closed
vicuna opened this issue Jan 11, 2009 · 5 comments
Assignees
Labels

Comments

@vicuna
Copy link

vicuna commented Jan 11, 2009

Original bug ID: 4690
Reporter: herbelin
Assigned to: @xavierleroy
Status: closed (set by @xavierleroy on 2011-05-29T10:14:17Z)
Resolution: fixed
Priority: normal
Severity: crash
Version: 3.11.0+beta
Fixed in version: 3.11.1+dev
Category: ~DO NOT USE (was: OCaml general)

Bug description

On Mac OS 10.5, using the ocaml-based coq system, some calls to dynamically loaded functions result in a bus error in caml_oldify_local_roots (line 191 of roots.c, d is NULL in "if (d->retaddr == retaddr) break;").

This has been reproduced in exactly the same situation on two different installations of Mac OS 10.5 but the problem, even on a given installation, is sensible to the execution context. For instance, changing the name of the files or directories may change the way the problem appears. I was unable to find a simple example (I guess that the program has to run for a while to set the gc in the faulty configuration).

The simplest way to certainly reproduce the problem is to export coq svn trunk revision 11773 (svn checkout -r 11773 svn://scm.gforge.inria.fr/svn/coq/trunk) with ocaml 3.11 and camlp5 5.10 installed, then "configure -local; make". Depending on the installation context, a problem appears either while compiling coq file theories/Classes/RelationClasses.v or theories/Logic/ChoiceFacts.v or theories/ZArith/Zdiv.v, ... (see also http://logical.saclay.inria.fr/coq-bugs/show_bug.cgi?id=2024). My own installation is 10.5.6 and Xcode Tools 3.2.1 on Core 2 Duo. Statistically, the compilation of the coq system is large enough to trigger the problem at least once (I no nobody who has succeeded to compile the recent modularisation of coq using dynamic loading on Mac OS 10.5 in full yet). No problem at all has been encountered on Mac OS 10.4.

The trace before crashing is the following:

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000000
caml_oldify_local_roots () at roots.c:191
191 if (d->retaddr == retaddr) break;
(gdb) bt full
#0 caml_oldify_local_roots () at roots.c:191
sp = 0xbfffe7f0 "\\005j"
retaddr = 18865600
regs = (value *) 0xbfffe7d0
d = (frame_descr *) 0x0
h = 261048
i = 283
j = <value temporarily unavailable, due to optimizations>
n = 0
ofs = <value temporarily unavailable, due to optimizations>
p = <value temporarily unavailable, due to optimizations>
glob = <value temporarily unavailable, due to optimizations>
root = <value temporarily unavailable, due to optimizations>
lr = <value temporarily unavailable, due to optimizations>
lnk = (link *) 0x0
#1 0x00306234 in caml_empty_minor_heap () at minor_gc.c:229
r = <value temporarily unavailable, due to optimizations>
#2 0x00306385 in caml_minor_collection () at minor_gc.c:272
prev_alloc_words = 0
#3 0x003041ad in caml_garbage_collection () at signals_asm.c:68
No locals.
#4 0x00313aeb in caml_call_gc ()
No symbol table info available.
Cannot access memory at address 0x5

File attachments

@vicuna
Copy link
Author

vicuna commented Jan 13, 2009

Comment author: herbelin

After further investigations, the problem can be traced back to Mac OS 10.5's dlopen which binds the redundant generic functions of the caml shared startup code to location outside the dynamic module (hence presumably in the main executable). Accordingly, dlopen binds the return addresses listed in the frame table to locations outside the dynamic module. On its side, the gc expects these return addresses to be located within the dynamically loaded module. Therefore, the gc fails to find in the frame table the return addresses it is looking for.

Surprisingly, the last return address of the frame table (which corresponds to the smaller label in the code for the generic functions) is bound yet elsewhere.

A fix which worked for us was first to replace the .globl directives declaring the generic function symbols in the startup file by .private_extern directives, secondly to add a dummy instruction before the declaration of the first generic function so as to have the last return address of the frame table bound locally too.

@vicuna
Copy link
Author

vicuna commented Jan 26, 2009

Comment author: @xavierleroy

Thanks for the detailed bug report and repro case. For the record, here is an analysis of what happens.

Consider a DLL that defines a function "f", which contains a GC point identified by the label "L100", therefore referenced from the table of frame descriptors for the DLL:

.text
.globl f
f: ...
L100: ...
...
.data
frametable: ...
.long L100
...

Now, it happens that the main program also defines function "f". ("f" is one of the caml_curry/caml_tuplify/etc combinators.) References to "f" within the DLL can therefore be resolved either to the "f" of the DLL or that of the main program. Which one is chosen depends on the context of the reference to "f" and also on the OS. But in itself it doesn't matter since the two "f" have exactly the same code.

The real cause of the bug is that MacOS 10.5 incorrectly resolves the reference to "L100" found in the frametable of the DLL: rather than resolve it to a pointer within the DLL's "f" function, it resolves it as a pointer within the main program's "f" function. Therefore, the GC point within the DLL's "f" is not correctly described, causing the observed crash in caml_oldify_local_roots.

I'm pretty sure this is a bug in MacOS 10.5: both 10.4 and ELF-based systems (Linux) correctly treat the reference to "L100" as local to the DLL, which it is.

A workaround is in preparation; to be continued.

@vicuna
Copy link
Author

vicuna commented Jan 26, 2009

Comment author: @xavierleroy

Implemented a fix (based on .private_extern, as suggested) in 3.11 release branch. With this fix, compilation of Coq 11773 succeeds. Tested only on x86 32 bits. To facilitate testing, a patch against 3.11.0 is attached.

@vicuna
Copy link
Author

vicuna commented Jan 27, 2009

Comment author: herbelin

I tried the patch (Core 2 duo, MacOS 10.5.6, Xcode 3.2.1) and I still fail to compile Coq. My understanding was that there is a second bug on top of the first one: the labels in the scope of the first generic function are treated differently from the other labels and get a wrong location even when the generic functions are declared private. A fix that worked for me is to add a dummy nop instruction between the _caml_shared_startup__code_begin label and the label of the first generic function. Without this extra fix, I still eventually get a bus error. Hope it helps.

@vicuna
Copy link
Author

vicuna commented Mar 28, 2009

Comment author: @xavierleroy

Added "nop" as suggested by Hugo to address the remaining problem.
(What happened is probably the following: if two labels point to the same location, and one is private_extern, and the other is global, the first one automatically becomes global.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants