| Anonymous | Login | Signup for a new account | 2013-05-23 21:33 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | |||||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | |||||||
| 0005700 | OCaml | OCaml backend (code generation) | public | 2012-07-26 12:35 | 2012-08-03 13:06 | |||||||
| Reporter | avsm | |||||||||||
| Assigned To | ||||||||||||
| Priority | high | Severity | crash | Reproducibility | sometimes | |||||||
| Status | resolved | Resolution | fixed | |||||||||
| Platform | OS | MacOS X | OS Version | 10.8 | ||||||||
| Product Version | 4.00.0+beta2/+rc1 | |||||||||||
| Target Version | 4.00.1+dev | Fixed in Version | 4.00.1+dev | |||||||||
| Summary | 0005700: native binaries crash in top-level exception handler | |||||||||||
| Description | With MacOS X 10.8 and latest XCode, native code binaries seem to crash if invoked from a subshell, with OCAMLRUNPARAM set to b. gdb ocamlbuild GNU gdb 6.3.50-20050815 (Apple version gdb-1820) (Sat Jun 16 02:40:11 UTC 2012) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries .. done (gdb) run -clean Starting program: /Users/avsm/.opam/4.00.0+rc1/bin/ocamlbuild -clean Reading symbols for shared libraries +............................. done Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: 13 at address: 0x0000000000000000 0x00007fff89012f88 in large_malloc () (gdb) bt #0 0x00007fff89012f88 in large_malloc () #1 0x00007fff8901974f in szone_malloc_should_clear () #2 0x00007fff8900b183 in malloc_zone_malloc () 0000003 0x00007fff8900bbd7 in malloc () 0000004 0x000000010007b7d4 in caml_stat_alloc () 0000005 0x0000000100077ed5 in caml_init_frame_descriptors () 0000006 0x000000010008db66 in caml_stash_backtrace () 0000007 0x000000010008e01d in caml_raise_exn () Previous frame inner to this frame (gdb could not unwind past this frame) (gdb) The program is running. Exit anyway? (y or n) y | |||||||||||
| Steps To Reproduce | I can reproduce this reliably by: $ git clone http://github.com/mirage/ocaml-cstruct [^] $ cd ocaml-cstruct/unix $ make (or make clean for a simpler example). It doesnt seem to happen directly from a shell, nor with a trivial Makefile that invokes ocamlbuild -clean directly. Narrowing it down now... | |||||||||||
| Tags | No tags attached. | |||||||||||
| Attached Files | ||||||||||||
Notes |
|
|
(0007814) lefessan (developer) 2012-07-26 18:55 edited on: 2012-07-26 18:56 |
I couldn't reproduce the problem on my Linux computer (I have no Mac OS X computer available). Could you just test this in the same directory: echo "include Hashtbl" > test.ml ocamlopt -g -o test test.ml ./test On my computer, "caml_init_frame_descriptors" is first called from the "String.contains" included in the "randomized_default" initialization of Hashtbl. Narrowing down would probably include compiling "libasmrund.a" in trunk/asmrun, and linking the test program above (if it fails too) with it to be able to have better debugging information. |
|
(0007815) avsm (reporter) 2012-07-26 21:01 |
This only happens on 10.8 x86_64 for me, and not reproducible on any other OS for me. The test-case above doesn't crash, and it can only be triggered when OCAMLRUNPARAM=b. I've got ocamlbuild in the 4.00 tree crashing with this shell script to build it after a 'make world.opt': [code] #!/bin/sh -ex cd asmrun && make libasmrun.a && cp libasmrun.a ../stdlib cd ../_build ../ocamlcompopt.sh -verbose -nostdlib unix.cmxa -g -I stdlib -I ../otherlibs/unix ocamlbuild/ocamlbuild_executor.cmx ocamlbuild/ocamlbuild_pack.cmx ocamlbuild/ocamlbuild_unix_plugin.cmx ocamlbuild/ocamlbuild.cmx -o ocamlbuild/ocamlbuild.native export OCAMLRUNPARAM=b #export DYLD_INSERT_LIBRARIES=/usr/lib/libgmalloc.dylib export MallocGuardEdges=1 export MallocCheckHeapStart=1 export MallocCheckHeapEach=1 export MallocScribble=1 ./ocamlbuild/ocamlbuild.native -clean [/code] Note that if the DYLD_INSERT_LIBRARIES is uncommented (to use the debug MacOS X malloc), then the program completes fine. The malloc checks don't make a difference. I'll try with libasmrund.a now and see if that also repros |
|
(0007816) avsm (reporter) 2012-07-26 21:06 |
Still happens with libasmrund.a, and here's the more helpful backtrace: (gdb) run -clean Starting program: /Users/avsm/src/git/bmeurer/ocaml/_build/ocamlbuild/ocamlbuild.native -clean bash(69204) malloc: enabling scribbling to detect mods to free blocks arch(69204) malloc: enabling scribbling to detect mods to free blocks Reading symbols for shared libraries +............................. done ocamlbuild.native(69204) malloc: enabling scribbling to detect mods to free blocks ### OCaml runtime: debug mode ### Initial minor heap size: 2048k bytes Initial major heap size: 992k bytes Initial space overhead: 80% Initial max overhead: 500% Initial heap increment: 992k bytes Initial allocation policy: 0 Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: 13 at address: 0x0000000000000000 0x00007fff89012f88 in large_malloc () (gdb) bt #0 0x00007fff89012f88 in large_malloc () #1 0x00007fff8901974f in szone_malloc_should_clear () #2 0x00007fff8900b183 in malloc_zone_malloc () 0000003 0x00007fff8900bbd7 in malloc () 0000004 0x0000000100081918 in caml_stat_alloc (sz=131072) at memory.d.c:529 0000005 0x0000000100078ad3 in caml_init_frame_descriptors () at roots.d.c:101 0000006 0x00000001000a0f19 in caml_stash_backtrace (exn=4301260168, pc=4295272412, sp=0x7fff5fbff5d0 "?\005`", trapsp=0x7fff5fbff5e0 " ??_?") at backtrace.d.c:73 0000007 0x00000001000a18fd in caml_raise_exn () Previous frame inner to this frame (gdb could not unwind past this frame) |
|
(0007817) avsm (reporter) 2012-07-26 21:26 |
I tell a lie; your test case does indeed trigger the error, but not with the debug library in this case: $ ocamlopt -g -o test test.ml && OCAMLRUNPARAM= ./test $ ocamlopt -g -o test test.ml && OCAMLRUNPARAM=b ./test Segmentation fault: 11 $ ocamlopt -g -runtime-variant d -o test test.ml && OCAMLRUNPARAM=b ./test ### OCaml runtime: debug mode ### Initial minor heap size: 2048k bytes Initial major heap size: 992k bytes Initial space overhead: 80% Initial max overhead: 500% Initial heap increment: 992k bytes Initial allocation policy: 0 $ ocamlopt -g -runtime-variant d -o test test.ml && OCAMLRUNPARAM= ./test ### OCaml runtime: debug mode ### Initial minor heap size: 2048k bytes Initial major heap size: 992k bytes Initial space overhead: 80% Initial max overhead: 500% Initial heap increment: 992k bytes Initial allocation policy: 0 $ |
|
(0007818) lefessan (developer) 2012-07-27 00:04 |
Have you got other versions of OCaml running this example without crashing ? 3.12.1, 4.0 beta1 ? any trunk revision ? |
|
(0007820) avsm (reporter) 2012-07-27 10:47 edited on: 2012-07-27 10:48 |
I've now uninstalled all previous Homebrew and done a fresh install of 3.12.1, and that crashes on this 10.8 machine also in the same way, so this is *not* a 4.00 regression. Now however, I cannot get the 'include Hashtbl' to crash on either 3.12.1 or 4.00.0 whereas my above traces do show it segfaulting in the past. The ocaml-cstruct repository example (which uses oasis) does continue to segfault on both 3.12 and 4.0 If this is memory corruption, then it could be address-space randomisation causing the differences in behaviour between runs. I'm leaving a second Mac Mini upgrading to 10.8 so that I can eliminate this one machine as a cause. I noticed that there was another similar report on the Caml list about this same problem: https://sympa.inria.fr/sympa/arc/caml-list/2012-07/msg00142.html [^] which indicates that it's not just my machine though. |
|
(0007824) lefessan (developer) 2012-07-27 13:06 |
For 3.12.1, there is no randomization in Hashtbls, so the initialization code won't raise an exception. The problem probably appears later. Maybe simply raising an exception would trigger the bug: let _ = raise Not_found and is the simplest reproducible case for all versions. Anyway, what is weird is that the bug appears inside "malloc", and the only reason I can see would be the corruption of the header/trailer of some previously allocated block with a previous malloc. Why this problem arises only on Mac OS X is another question... |
|
(0007825) avsm (reporter) 2012-07-27 13:11 |
The other odd thing is that the various malloc guard variables (which add guard pages per-allocation and scribble over fresh memory, and generally try to detect heap corruption) do not detect any corruption of the malloc structures. Unfortunately, the pre-10.8 MacOS X method of disabling address-space randomisation (DYLD_NO_PIE) appears to have been removed in this version. I'll continue to try and find a reproducible small case (I can still repro it with the cstruct compilation, but not with a smaller test case anymore) |
|
(0007835) avsm (reporter) 2012-07-30 17:04 |
Another upgraded Mac (from 10.7->10.8 and freshly installed OCaml toolchain) exhibits the same behaviour. It's reproducible by just running 'ocamlopt.opt' 3.12.1 or 4.00.0 with OCAMLRUNPARAM=b $ uname -a Darwin cubik.local 12.0.0 Darwin Kernel Version 12.0.0: Sun Jun 24 23:00:16 PDT 2012; root:xnu-2050.7.9~1/RELEASE_X86_64 x86_64 $ gcc -v Using built-in specs. Target: i686-apple-darwin11 Configured with: /private/var/tmp/llvmgcc42/llvmgcc42-2336.11~28/src/configure --disable-checking --enable-werror --prefix=/Applications/Xcode.app/Contents/Developer/usr/llvm-gcc-4.2 --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-prefix=llvm- --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin11 --enable-llvm=/private/var/tmp/llvmgcc42/llvmgcc42-2336.11~28/dst-llvmCore/Developer/usr/local --program-prefix=i686-apple-darwin11- --host=x86_64-apple-darwin11 --target=i686-apple-darwin11 --with-gxx-include-dir=/usr/include/c++/4.2.1 Thread model: posix gcc version 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2336.11.00) $ ocamlopt -v The Objective Caml native-code compiler, version 3.12.1 Standard library directory: /usr/local/lib/ocaml $ ocamlopt.opt Segmentation fault: 11 cubik:x avsm$ gdb ocamlopt.opt GNU gdb 6.3.50-20050815 (Apple version gdb-1820) (Sat Jun 16 02:40:11 UTC 2012) Copyright 2004 Free Software Foundation, Inc. GDB is free software, covered by the GNU General Public License, and you are welcome to change it and/or distribute copies of it under certain conditions. Type "show copying" to see the conditions. There is absolutely no warranty for GDB. Type "show warranty" for details. This GDB was configured as "x86_64-apple-darwin"...Reading symbols for shared libraries .. done (gdb) run Starting program: /usr/local/bin/ocamlopt.opt Reading symbols for shared libraries +............................. done Program received signal EXC_BAD_ACCESS, Could not access memory. Reason: 13 at address: 0x0000000000000000 0x00007fff89d2ff88 in large_malloc () (gdb) bt #0 0x00007fff89d2ff88 in large_malloc () #1 0x00007fff89d3674f in szone_malloc_should_clear () #2 0x00007fff89d28183 in malloc_zone_malloc () 0000003 0x00007fff89d28bd7 in malloc () 0000004 0x00000001001523c4 in caml_stat_alloc () 0000005 0x000000010014eb05 in caml_init_frame_descriptors () 0000006 0x0000000100162fd6 in caml_stash_backtrace () 0000007 0x00000001001634fc in .L111 () 0000008 0x000000010014e7cd in caml_raise_constant () 0000009 0x000000010014e7f0 in caml_raise_not_found () 0000010 0x000000010015d313 in caml_sys_getenv () 0000011 0x00000001001633ac in caml_c_call () Previous frame inner to this frame (gdb could not unwind past this frame) I've tried a few smaller test programs to see if I can spot the heap corruption before malloc, but nothing triggers it yet. Using ocamlopt.opt as the failing testcase, the program crashes after startup.c/caml_start_program is called. If I initialise the frame tables early in startup.c by adding: #include "roots.h" if (caml_frame_descriptors == NULL) { caml_init_frame_descriptors(); } to startup.c just before "res = caml_start_program", then the problem vanishes. |
|
(0007869) ygrek (reporter) 2012-08-02 11:21 |
As this first apeeared in 4.00 and is connected with debugging - a blind guess - maybe CFI is an issue? (e.g. macos profiles mallocs and samples stack at each allocation and due to incorrect(?) cfi information fails badly?). What happens if one compiles ocaml without cfi enabled (after configure set ASM_CFI_SUPPORTED=no in Makefile.config)? PS Not connected, but I have seen tcmalloc segfaulting on linux while probing stack for profiling. |
|
(0007871) xleroy (administrator) 2012-08-02 13:26 |
I have a possible explanation, but it's a wild shot. MacOS X is very touchy about the stack pointer being 16-aligned, or more precisely about C functions being entered with rsp mod 16 = 8. That's because their C compiler sometimes emits SSE2 128-bit load and store instructions that demand 16-alignment of their target addresses. Re-reading the code of caml_raise_exception in asmrun/amd64.S, I see that it violates this alignment constraint: caml_stash_backtrace is entered with rsp mod 16 = 0. Most of the time, caml_stash_backtrace doesn't call any C library function, so no bad things happen. If, however, the frame table was not initialized before, caml_stash_backtrace calls caml_init_frame_descriptors, which does a lot of work, including calling malloc(). And maybe the malloc() in 10.8 happens to use those strictly-aligned SSE2 instructions. Bottom line: could you please try to apply the patch (attached to this PR and included below for e-mail convenience) to asmrun/amd64.S and let us know if the problem is still here? What the patch does is simply to fix caml_raise_exception so that it maintains proper stack alignment. Index: asmrun/amd64.S =================================================================== --- asmrun/amd64.S (revision 12802) +++ asmrun/amd64.S (working copy) @@ -483,9 +483,10 @@ LBL(110): movq %rax, %r12 /* Save exception bucket */ movq %rax, C_ARG_1 /* arg 1: exception bucket */ - movq 0(%rsp), C_ARG_2 /* arg 2: pc of raise */ - leaq 8(%rsp), C_ARG_3 /* arg 3: sp of raise */ + popq C_ARG_2 /* arg 2: pc of raise */ + movq %rsp, C_ARG_3 /* arg 3: sp at raise */ movq %r14, C_ARG_4 /* arg 4: sp of handler */ + /* PR#5700: thanks to popq above, stack is now 16-aligned */ PREPARE_FOR_C_CALL /* no need to cleanup after */ call GCALL(caml_stash_backtrace) movq %r12, %rax /* Recover exception bucket */ |
|
(0007872) serp (reporter) 2012-08-02 14:05 |
Patch is OK. On Mac OS 10.8 "ocaml 4.00.0" with "OCAMLRUNPARAM=b" - compiled successfuly. Thanks. |
|
(0007873) xleroy (administrator) 2012-08-02 14:50 |
Patch to amd64.S committed in 4.00 bugfix branch (r12815) and on trunk (r12816). I'm not 100% sure it fixes avsm's original issue, but optimistically assume that it does. Please reopen this PR if the problem persists. |
|
(0007876) doligez (manager) 2012-08-02 20:31 |
I have uploaded a version of Xavier's patch that applies cleanly to 4.00.0, and I confirm that it fixes at least avsm's "ocamlopt.opt" repro case. It seems that you will need to "make clean" after applying this patch (instead of rebuilding right away). |
|
(0007878) avsm (reporter) 2012-08-02 23:58 |
Sorry about the (vacation-induced) delayed response. The patch does indeed eliminate the segfault on 4.00.0 for me also, and the fix is confirmed in gdb: $ env OCAMLRUNPARAM=b gdb ./ocamlopt-4.00.0.opt.broken (gdb) break caml_stash_backtrace Breakpoint 1 at 0x1001a7288: file backtrace.d.c, line 65. (gdb) run Breakpoint 1, caml_stash_backtrace (exn=4313839736, pc=4296384572, sp=0x7fff5fbff9b0 "??\037\001\001", trapsp=0x7fff5fbff9c0 "") at backtrace.d.c:65 (gdb) print $rsp $1 = (void *) 0x7fff5fbff958 (gdb) cont Continuing. Program received signal EXC_BAD_ACCESS, Could not access memory. With the patch, rsp is 16-byte aligned after the CALL instruction to caml_stash_backtrace: (gdb) break caml_stash_backtrace Breakpoint 1 at 0x100194ebd (gdb) run Starting program: /Users/avsm/src/git/bmeurer/ocaml/ocamlopt-4.00.fixed Reading symbols for shared libraries +............................. done Breakpoint 1, 0x0000000100194ebd in caml_stash_backtrace () (gdb) print $rsp $1 = (void *) 0x7fff5fbff950 I backported this to 3.12.1 to help with migrating our repositories, and it gets further but still crashes shortly afterwards from an early caml_c_call: Reason: 13 at address: 0x0000000000000000 0x00007fff8c3e6f88 in large_malloc () (gdb) bt #0 0x00007fff8c3e6f88 in large_malloc () #1 0x00007fff8c3ed74f in szone_malloc_should_clear () #2 0x00007fff8c3df183 in malloc_zone_malloc () 0000003 0x00007fff8c3dfbd7 in malloc () 0000004 0x00000001001523c4 in caml_stat_alloc () 0000005 0x000000010014eb05 in caml_init_frame_descriptors () 0000006 0x0000000100162fd6 in caml_stash_backtrace () 0000007 0x00000001001634f8 in .L111 () 0000008 0x000000010014e7cd in caml_raise_constant () 0000009 0x000000010014e7f0 in caml_raise_not_found () 0000010 0x000000010015d313 in caml_sys_getenv () 0000011 0x00000001001633ac in caml_c_call () rsp is also misaligned here in caml_c_call: Breakpoint 1, 0x0000000100163380 in caml_c_call () (gdb) print $rsp $1 = (void *) 0x7fff5fbff9b8 I couldn't spot any differences between the 3.12.1 and 4.00.0 caml_c_call implementations, so I just wanted to check that it isn't just working by a lucky alignment. |
|
(0007880) xleroy (administrator) 2012-08-03 09:37 |
Thanks again for the precious feedback. I don't have MacOS 10.8 installed, so I just instrumented amd64.S to check stack alignment before every call to C functions, and lo and behold, there is another call to caml_stash_backtrace with a misaligned SP... Attached to this PR (alignment-caml-raise-exception-2.diff) and included below for e-mail convenience is a second patch, to be applied on top of the previous one, which should complete the fix. Let me know how it goes. (For 3.12.1, just insert "subq $8, %rsp" before "call GCALL(caml_stash_backtrace)" in asmrun/amd64.S, function caml_raise_exception.) Index: amd64.S =================================================================== --- amd64.S (revision 12816) +++ amd64.S (working copy) @@ -510,6 +510,7 @@ LOAD_VAR(caml_last_return_address,C_ARG_2) /* arg 2: pc of raise */ LOAD_VAR(caml_bottom_of_stack,C_ARG_3) /* arg 3: sp of raise */ LOAD_VAR(caml_exception_pointer,C_ARG_4) /* arg 4: sp of handler */ + subq $8, %rsp /* PR#5700: maintain stack alignment */ PREPARE_FOR_C_CALL /* no need to cleanup after */ call GCALL(caml_stash_backtrace) movq %r12, %rax /* Recover exception bucket */ |
|
(0007881) xleroy (administrator) 2012-08-03 09:43 |
Second patch commited in 4.00 bugfix branch (r12817) and in trunk (r12818). |
|
(0007885) avsm (reporter) 2012-08-03 13:06 |
Perfect! A quick spin sees everything working, and I'll try it more when I'm back next week. For anyone else who needs a quick fix, I've uploaded combined patches against 3.12.1 and 4.00.0 to this ticket, and submitted pull requests to Homebrew: 3.12.1: https://github.com/mxcl/homebrew/pull/13913 [^] 4.00.0: in my tree in http://github.com/avsm/homebrew [^] (ocaml4-upgrade branch) while I test it more |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2012-07-26 12:35 | avsm | New Issue | |
| 2012-07-26 18:55 | lefessan | Note Added: 0007814 | |
| 2012-07-26 18:56 | lefessan | Note Edited: 0007814 | View Revisions |
| 2012-07-26 21:01 | avsm | Note Added: 0007815 | |
| 2012-07-26 21:06 | avsm | Note Added: 0007816 | |
| 2012-07-26 21:26 | avsm | Note Added: 0007817 | |
| 2012-07-27 00:04 | lefessan | Note Added: 0007818 | |
| 2012-07-27 10:47 | avsm | Note Added: 0007820 | |
| 2012-07-27 10:48 | avsm | Note Edited: 0007820 | View Revisions |
| 2012-07-27 13:06 | lefessan | Note Added: 0007824 | |
| 2012-07-27 13:06 | lefessan | Assigned To | => lefessan |
| 2012-07-27 13:06 | lefessan | Status | new => acknowledged |
| 2012-07-27 13:07 | lefessan | Assigned To | lefessan => |
| 2012-07-27 13:07 | lefessan | Target Version | => 4.01.0+dev |
| 2012-07-27 13:11 | avsm | Note Added: 0007825 | |
| 2012-07-30 17:04 | avsm | Note Added: 0007835 | |
| 2012-07-31 13:36 | doligez | Target Version | 4.01.0+dev => 4.00.1+dev |
| 2012-08-02 09:50 | doligez | Priority | normal => high |
| 2012-08-02 11:21 | ygrek | Note Added: 0007869 | |
| 2012-08-02 13:18 | xleroy | File Added: alignment-caml-raise-exception.diff | |
| 2012-08-02 13:26 | xleroy | Note Added: 0007871 | |
| 2012-08-02 13:26 | xleroy | Status | acknowledged => feedback |
| 2012-08-02 14:05 | serp | Note Added: 0007872 | |
| 2012-08-02 14:50 | xleroy | Note Added: 0007873 | |
| 2012-08-02 14:50 | xleroy | Status | feedback => resolved |
| 2012-08-02 14:50 | xleroy | Resolution | open => fixed |
| 2012-08-02 14:50 | xleroy | Fixed in Version | => 4.00.1+dev |
| 2012-08-02 20:28 | doligez | File Added: alignment-caml-raise-exception-4.00.0.patch | |
| 2012-08-02 20:31 | doligez | Note Added: 0007876 | |
| 2012-08-02 23:58 | avsm | Note Added: 0007878 | |
| 2012-08-03 00:02 | avsm | File Added: alignment-caml-raise-exception-3.12.1.patch | |
| 2012-08-03 09:30 | xleroy | File Added: alignment-caml-raise-exception-2.diff | |
| 2012-08-03 09:37 | xleroy | Note Added: 0007880 | |
| 2012-08-03 09:43 | xleroy | Note Added: 0007881 | |
| 2012-08-03 12:33 | avsm | File Added: alignment-caml-raise-exception-4.00.0-combined-2.patch | |
| 2012-08-03 12:34 | avsm | File Added: alignment-caml-raise-exception-3.12.1-combined-2.patch | |
| 2012-08-03 13:06 | avsm | Note Added: 0007885 | |
| Copyright © 2000 - 2011 MantisBT Group |