New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
native binaries crash in top-level exception handler #5700
Comments
Comment author: @lefessan I couldn't reproduce the problem on my Linux computer (I have no Mac OS X computer available). Could you just test this in the same directory: echo "include Hashtbl" > test.ml On my computer, "caml_init_frame_descriptors" is first called from the "String.contains" included in the "randomized_default" initialization of Hashtbl. Narrowing down would probably include compiling "libasmrund.a" in trunk/asmrun, and linking the test program above (if it fails too) with it to be able to have better debugging information. |
Comment author: @avsm This only happens on 10.8 x86_64 for me, and not reproducible on any other OS for me. The test-case above doesn't crash, and it can only be triggered when OCAMLRUNPARAM=b. I've got ocamlbuild in the 4.00 tree crashing with this shell script to build it after a 'make world.opt': [code] #!/bin/sh -ex Note that if the DYLD_INSERT_LIBRARIES is uncommented (to use the debug MacOS X malloc), then the program completes fine. The malloc checks don't make a difference. I'll try with libasmrund.a now and see if that also repros |
Comment author: @avsm Still happens with libasmrund.a, and here's the more helpful backtrace: (gdb) run -clean OCaml runtime: debug modeInitial minor heap size: 2048k bytes Program received signal EXC_BAD_ACCESS, Could not access memory. |
Comment author: @avsm I tell a lie; your test case does indeed trigger the error, but not with the debug library in this case: $ ocamlopt -g -o test test.ml && OCAMLRUNPARAM= ./test OCaml runtime: debug modeInitial minor heap size: 2048k bytes OCaml runtime: debug modeInitial minor heap size: 2048k bytes |
Comment author: @lefessan Have you got other versions of OCaml running this example without crashing ? 3.12.1, 4.0 beta1 ? any trunk revision ? |
Comment author: @avsm I've now uninstalled all previous Homebrew and done a fresh install of 3.12.1, and that crashes on this 10.8 machine also in the same way, so this is not a 4.00 regression. Now however, I cannot get the 'include Hashtbl' to crash on either 3.12.1 or 4.00.0 whereas my above traces do show it segfaulting in the past. The ocaml-cstruct repository example (which uses oasis) does continue to segfault on both 3.12 and 4.0 If this is memory corruption, then it could be address-space randomisation causing the differences in behaviour between runs. I'm leaving a second Mac Mini upgrading to 10.8 so that I can eliminate this one machine as a cause. I noticed that there was another similar report on the Caml list about this same problem: https://sympa.inria.fr/sympa/arc/caml-list/2012-07/msg00142.html which indicates that it's not just my machine though. |
Comment author: @lefessan For 3.12.1, there is no randomization in Hashtbls, so the initialization code won't raise an exception. The problem probably appears later. Maybe simply raising an exception would trigger the bug: let _ = raise Not_found and is the simplest reproducible case for all versions. Anyway, what is weird is that the bug appears inside "malloc", and the only reason I can see would be the corruption of the header/trailer of some previously allocated block with a previous malloc. Why this problem arises only on Mac OS X is another question... |
Comment author: @avsm The other odd thing is that the various malloc guard variables (which add guard pages per-allocation and scribble over fresh memory, and generally try to detect heap corruption) do not detect any corruption of the malloc structures. Unfortunately, the pre-10.8 MacOS X method of disabling address-space randomisation (DYLD_NO_PIE) appears to have been removed in this version. I'll continue to try and find a reproducible small case (I can still repro it with the cstruct compilation, but not with a smaller test case anymore) |
Comment author: @avsm Another upgraded Mac (from 10.7->10.8 and freshly installed OCaml toolchain) exhibits the same behaviour. It's reproducible by just running 'ocamlopt.opt' 3.12.1 or 4.00.0 with OCAMLRUNPARAM=b $ uname -a (gdb) run Program received signal EXC_BAD_ACCESS, Could not access memory. I've tried a few smaller test programs to see if I can spot the heap corruption before malloc, but nothing triggers it yet. Using ocamlopt.opt as the failing testcase, the program crashes after startup.c/caml_start_program is called. If I initialise the frame tables early in startup.c by adding:
to startup.c just before "res = caml_start_program", then the problem vanishes. |
Comment author: @ygrek As this first apeeared in 4.00 and is connected with debugging - a blind guess - maybe CFI is an issue? (e.g. macos profiles mallocs and samples stack at each allocation and due to incorrect(?) cfi information fails badly?). What happens if one compiles ocaml without cfi enabled (after configure set ASM_CFI_SUPPORTED=no in Makefile.config)? PS Not connected, but I have seen tcmalloc segfaulting on linux while probing stack for profiling. |
Comment author: @xavierleroy I have a possible explanation, but it's a wild shot. MacOS X is very touchy about the stack pointer being 16-aligned, or more precisely about C functions being entered with rsp mod 16 = 8. That's because their C compiler sometimes emits SSE2 128-bit load and store instructions that demand 16-alignment of their target addresses. Re-reading the code of caml_raise_exception in asmrun/amd64.S, I see that it violates this alignment constraint: caml_stash_backtrace is entered with rsp mod 16 = 0. Most of the time, caml_stash_backtrace doesn't call any C library function, so no bad things happen. If, however, the frame table was not initialized before, caml_stash_backtrace calls caml_init_frame_descriptors, which does a lot of work, including calling malloc(). And maybe the malloc() in 10.8 happens to use those strictly-aligned SSE2 instructions. Bottom line: could you please try to apply the patch (attached to this PR and included below for e-mail convenience) to asmrun/amd64.S and let us know if the problem is still here? What the patch does is simply to fix caml_raise_exception so that it maintains proper stack alignment. Index: asmrun/amd64.S--- asmrun/amd64.S (revision 12802)
|
Comment author: serp Patch is OK. On Mac OS 10.8 "ocaml 4.00.0" with "OCAMLRUNPARAM=b" - compiled successfuly. Thanks. |
Comment author: @xavierleroy Patch to amd64.S committed in 4.00 bugfix branch (r12815) and on trunk (r12816). I'm not 100% sure it fixes avsm's original issue, but optimistically assume that it does. Please reopen this PR if the problem persists. |
Comment author: @damiendoligez I have uploaded a version of Xavier's patch that applies cleanly to 4.00.0, and I confirm that it fixes at least avsm's "ocamlopt.opt" repro case. It seems that you will need to "make clean" after applying this patch (instead of rebuilding right away). |
Comment author: @avsm Sorry about the (vacation-induced) delayed response. The patch does indeed eliminate the segfault on 4.00.0 for me also, and the fix is confirmed in gdb: $ env OCAMLRUNPARAM=b gdb ./ocamlopt-4.00.0.opt.broken With the patch, rsp is 16-byte aligned after the CALL instruction to caml_stash_backtrace: (gdb) break caml_stash_backtrace I backported this to 3.12.1 to help with migrating our repositories, and it gets further but still crashes shortly afterwards from an early caml_c_call: Reason: 13 at address: 0x0000000000000000 rsp is also misaligned here in caml_c_call: I couldn't spot any differences between the 3.12.1 and 4.00.0 caml_c_call implementations, so I just wanted to check that it isn't just working by a lucky alignment. |
Comment author: @xavierleroy Thanks again for the precious feedback. I don't have MacOS 10.8 installed, so I just instrumented amd64.S to check stack alignment before every call to C functions, and lo and behold, there is another call to caml_stash_backtrace with a misaligned SP... Attached to this PR (alignment-caml-raise-exception-2.diff) and included below for e-mail convenience is a second patch, to be applied on top of the previous one, which should complete the fix. Let me know how it goes. (For 3.12.1, just insert "subq $8, %rsp" before "call GCALL(caml_stash_backtrace)" in asmrun/amd64.S, function caml_raise_exception.) Index: amd64.S--- amd64.S (revision 12816)
|
Comment author: @xavierleroy Second patch commited in 4.00 bugfix branch (r12817) and in trunk (r12818). |
Comment author: @avsm Perfect! A quick spin sees everything working, and I'll try it more when I'm back next week. For anyone else who needs a quick fix, I've uploaded combined patches against 3.12.1 and 4.00.0 to this ticket, and submitted pull requests to Homebrew: 3.12.1: Homebrew/legacy-homebrew#13913 |
Comment author: Richard Jones FYI I have hit the same issue in slightly different circumstances. It's basically an example of this:
(gdb) bt
=> 0x0818e054 <+324>: movapd %xmm1,0x20(%esp)
I will see if I can make a variant of the amd64 patch to work on 32 bit and see if that fixes the bug. |
Comment author: Richard Jones There is a long thread/argument here which basically says we need to use 16 byte stack alignment in order to interoperate with gcc: |
Comment author: Richard Jones FWIW I decided to workaround (instead of fix) this issue by compiling OCaml with: CFLAGS=-mpreferred-stack-boundary=2 ./configure |
Original bug ID: 5700
Reporter: @avsm
Status: closed (set by @xavierleroy on 2015-12-11T18:25:33Z)
Resolution: fixed
Priority: high
Severity: crash
OS: MacOS X
OS Version: 10.8
Version: 4.00.0+beta2/+rc1
Target version: 4.00.1+dev
Fixed in version: 4.00.1+dev
Category: back end (clambda to assembly)
Monitored by: serp @ygrek "Richard Jones" @dbuenzli
Bug description
With MacOS X 10.8 and latest XCode, native code binaries seem to crash if invoked from a subshell, with OCAMLRUNPARAM set to b.
gdb ocamlbuild
GNU gdb 6.3.50-20050815 (Apple version gdb-1820) (Sat Jun 16 02:40:11 UTC 2012)
Copyright 2004 Free Software Foundation, Inc.
GDB is free software, covered by the GNU General Public License, and you are
welcome to change it and/or distribute copies of it under certain conditions.
Type "show copying" to see the conditions.
There is absolutely no warranty for GDB. Type "show warranty" for details.
This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries .. done
(gdb) run -clean
Starting program: /Users/avsm/.opam/4.00.0+rc1/bin/ocamlbuild -clean
Reading symbols for shared libraries +............................. done
Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: 13 at address: 0x0000000000000000
0x00007fff89012f88 in large_malloc ()
(gdb) bt
#0 0x00007fff89012f88 in large_malloc ()
#1 0x00007fff8901974f in szone_malloc_should_clear ()
#2 0x00007fff8900b183 in malloc_zone_malloc ()
#3 0x00007fff8900bbd7 in malloc ()
#4 0x000000010007b7d4 in caml_stat_alloc ()
#5 0x0000000100077ed5 in caml_init_frame_descriptors ()
#6 0x000000010008db66 in caml_stash_backtrace ()
#7 0x000000010008e01d in caml_raise_exn ()
Previous frame inner to this frame (gdb could not unwind past this frame)
(gdb) The program is running. Exit anyway? (y or n) y
Steps to reproduce
I can reproduce this reliably by:
$ git clone http://github.com/mirage/ocaml-cstruct
$ cd ocaml-cstruct/unix
$ make (or make clean for a simpler example).
It doesnt seem to happen directly from a shell, nor with a trivial Makefile that invokes ocamlbuild -clean directly. Narrowing it down now...
File attachments
The text was updated successfully, but these errors were encountered: