|Anonymous | Login | Signup for a new account||2019-02-24 05:43 CET|
|Main | My View | View Issues | Change Log | Roadmap|
|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0007120||OCaml||back end (clambda to assembly)||public||2016-01-09 08:26||2017-08-27 17:48|
|Target Version||4.03.0+dev / +beta1||Fixed in Version||4.03.0+dev / +beta1|
|Summary||0007120: Extra .cfi_adjust_cfa_offset directive between ret and .cfi_endproc leads to incorrect unwind behavior on OS X|
|Description||ocamlopt (at least sometimes, perhaps always) generates an extra .cfi_adjust_cfa_offset directive between ret and .cfi_endproc at the end of an assembly routine. (.cfi_xxx directives generate DWARF stack unwinding information. This information enables debuggers such as LLDB to produce a stack backtrace even for code that does not adhere to the classical stack frame layout with base pointers. It is also used by the runtime system of certain programming languages, including C++ and Objective-C, to walk and unwind the stack for exception handling.)|
This causes the .cfi_adjust_cfa_offset directives to not be pairwise balanced, i.e. this extra directive does not have a counterpart that cancels it out.
The resulting unwind info causes some programs interpreting this info (at least Apple's libunwind 35.3 on OS X 10.10.1, perhaps other consumers on other systems as well) to take the directive into account for subsequent assembly routines, causing the CFA offsets to accumulate and be incorrect for these subsequent routines.
The consequences are truncated call stacks in LLDB. But more importantly, this can lead to crashes if OCaml code calls into code that performs stack walks to find exception handlers. For example, Cocoa (the OS X GUI library) triggers such stack walks. See also bug 7118.
|Steps To Reproduce||Unzip the attached archive. Then:|
break set -n camlMylib__bar_1223 # The number is non-deterministic. Check mylib.s for the right value.
bt # Observe that the backtrace is not correct: at this point there are really at least 10 stack frames: dylib`start -> main -> caml_main -> caml_start_program -> caml_program -> camlDriver__entry -> caml_apply11 -> camlMylib__foo1 -> caml_apply10 -> camlMylib__bar
target modules show-unwind -n camlMylib__foo1_1199
target modules show-unwind -n camlMylib__foo2_1211
target modules show-unwind -n camlMylib__bar_1223
The unwind info shown for foo1 is correct. Notice, however, that the unwind info shown for foo2 is different from that shown for foo1, even though these two OCaml functions have identical bodies and identical assembly routines. The info for foo2 is incorrect. The extra .cfi_adjust_cfa_offset value specified at the end of the assembly routine for foo1 has been accumulated onto the CFA offsets for foo2.
|Tags||No tags attached.|
|Attached Files||unwind-bug-repro.zip [^] (928 bytes) 2016-01-09 08:26|
I have a fix for this issue. See https://github.com/btj/ocaml/commit/4183ce52251948f986b8a06cf5a9405eeecd0224 [^] .
I now get correct stack traces, except that they still stop at caml_start_program. This is an assembly routine defined in asmrun/amd64.S. I am investigating.
I will try to fix this problem, then add a regression test, and then create a pull request.
I now have clean, full backtraces in LLDB. I submitted a pull request (https://github.com/ocaml/ocaml/pull/408 [^]).
Some useful things I learned:
- You can disassemble object files on OS X using otool -t -v myobj.o
- You can see unwind info using dwarfdump --eh-frame myobj.o
I have now added an automated regression test to the abovementioned pull request. It runs only on OS X. It uses libunwind to walk the stack. It succeeds if the stack walk reaches main.
It took me some investigating to come up with the regression test. I had trouble getting the stack walk to walk the C functions.
It turned out that there was no unwind information (neither compact nor DWARF) for the C functions in the executable, even though the object files for the C modules did have both DWARF and compact unwind info. So, the linker dropped both.
The problem is that by default, ocamlopt passes the -Wl,-no_compact_unwind flag to clang, which causes clang to pass -no_compact_unwind to ld. That explains why ld did not copy the compact unwind info from the object files into the executable. It also doesn't seem to copy the DWARF info. Presumably, that's because there's a separate check that says: if there's compact unwind info, don't copy the DWARF info. (That makes sense: the whole point of the compact info is to reduce the size of the binary (as well as to speed up run-time lookups).)
It turns out you can force ld to copy the DWARF info using the -keep_dwarf_unwind flag. This solves the problem.
Note: it's important that ocamlopt passes -no_compact_unwind to ld. That's because without that flag, ld (at least on my machine, OS X 10.10.1, ld64-241.9) will infer compact unwind info from the DWARF info. Unfortunately, it seems to do so incorrectly: without the -no_compact_unwind flag, ld generates incorrect compact unwind info for the OCaml functions.
(BTW: The regression test is written in C and uses libunwind explicitly. I also wrote a test case in C++ with a custom main function, which throws a C++ exception through the OCaml code. The test case succeeds if the exception is successfully caught in main. Compiling this test case requires passing -cc clang++ to ocamlopt when building the executable. But when using -cc, ocamlopt does not generate the -Wl,-no_compact_unwind option! So, when using -cc clang++, explicitly pass -cclib -Wl,-no_compact_unwind,-keep_dwarf_unwind and the test case will work!)
Philosophical note: I don't know what the original motivation was for having ocamlopt generate DWARF unwinding info. Presumably, the goal was to facilitate debugging, *not* to enable run-time stack unwinding.
Of course, the stability and correctness requirements for the debugging scenario are much weaker than for the run-time stack unwinding scenario. It is, therefore, unfortunate that this info is being used inadvertently by the OS X unwinder at run time. However, there seems to be a solution: if the goal really is only to facilitate debugging, then the frame info should be put into the .debug_frame section of the object file, not the .eh_frame section! (Note: I didn't test that this does the right thing.)
Unfortunately, it seems that the LLVM assembler (invoked by passing a .s file to clang) always emits the information specified by the .cfi_xxx directives into the .eh_frame section, never into the .debug_frame section. I would say that's a bug in the LLVM assembler. (The offending code seems to be in class ASMPrinter::needsCFIMoves (http://llvm.org/doxygen/AsmPrinter_8cpp_source.html#l00799 [^]).) Participating in the run-time unwinding story should always be an opt-in, not a default.
I have filed an LLVM bug (https://llvm.org/bugs/show_bug.cgi?id=26191 [^]).
I have also filed a bug about the LLVM assembler not flagging unbalanced .cfi_adjust_cfa_offset directives as an error (https://llvm.org/bugs/show_bug.cgi?id=26098 [^]).
Update on .eh_frame versus .debug_frame: It turns out that there is an assembler directive called '.cfi_sections'. You can get the assembler to emit CFI information into the .debug_frame section of the object file instead of the .eh_frame section by putting '.cfi_sections .debug_frame' at the top of the assembly file. I tested it; it works on OS X with the clang assembler.
However, the OS X linker (ld64) does not seem to copy the .debug_frame section into either the executable or the .dSYM bundle. In fact, the string "debug_frame" does not appear at all in ld64's source code. So this seems to be a dead end.
In any case, even if ld64 supported .debug_frame, it should copy it into the .dSYM bundle that holds the debugging information, not the executable itself. Therefore, if the reason why ocamlopt emits CFI information is to ensure that processes running OCaml code have proper stack traces, then this would be achieved only if the .dSYM bundle for the executable was present. (And actually, I'm not even sure that debuggers or other backtrace producers such as OS X's crash diagnostic report generator would use the .debug_frame section if present. Conversely, I'm not sure libunwind would *not* use it.) So, if the goal is to support proper stack traces even for "stripped" executables (i.e., ones without a .dSYM), then emitting an .eh_frame section seems to be the only option. (Actually, the other option is to use frame pointers.)
For OCaml users who don't care about native backtraces and who don't want to be exposed to possible remaining or newly introduced CFI bugs, options include passing the -cclib -Wl,-no_keep_dwarf_unwind option to ocamlopt when generating the binary (didn't test it yet), and/or passing the -fno-unwind-tables option to clang when compiling OCaml-to-C glue code modules (tested).
For users who do want native backtraces, it seems advisable to test the CFI information. One way to do so would be to include a call to a perform_stack_walk() function (like the one in the regression test case in my pull request, see above) into each OCaml-to-C glue code function in debug builds. perform_stack_walk() would use libunwind to walk the stack, and fail if the walk did not reach the main function. I might try and contribute something like this to lablgtk.
Actually, the -fno-unwind-tables approach for OCaml-to-C glue code seems to be optimal: it stops libunwind stack walks, but it doesn't prevent proper backtraces from being generated by OS X on a crash, or by lldb, provided the OCaml-to-C glue code has "standard" stack frames (e.g. using a frame pointer). (Didn't test yet.)
Also, I realized that I can figure out the original intention of emitting the .cfi directives by looking into the OCaml repo history. It turns out the CFI information was originally added on Feb 21, 2012 (https://github.com/ocaml/ocaml/commit/2eecf2d4c0933f64bcfa3e620f031497166db338 [^]) in response to PR#5487 (http://caml.inria.fr/mantis/view.php?id=5487 [^]). The goal was to get decent backtraces in gdb on Linux/x86 and Linux/amd64.
|For the record: OCaml's ./configure script has a -no-cfi option.|
This has been fixed by Bart Jacob's PR
which has been merged in 4.03:
edited on: 2016-03-08 14:50
There is a problematic bug reported against the Coq project, which regularly crashes CoqIDE on MacOSX, and I think it is the same issue (an OSX trace shows a crash in libunwind after trying to install an exception handler from Obj-C):
Thanks a lot for the very detailed information in this bug report, I think it will be very helpful for the Coq team.
|2016-01-09 08:26||bartjacobs||New Issue|
|2016-01-09 08:26||bartjacobs||File Added: unwind-bug-repro.zip|
|2016-01-09 10:01||bartjacobs||Note Added: 0015244|
|2016-01-09 13:14||bartjacobs||Note Added: 0015245|
|2016-01-09 14:15||gasche||Relationship added||related to 0007118|
|2016-01-18 11:59||bartjacobs||Note Added: 0015261|
|2016-01-18 13:10||bartjacobs||Note Added: 0015262|
|2016-01-25 15:49||bartjacobs||Note Added: 0015269|
|2016-01-28 16:18||bartjacobs||Note Added: 0015285|
|2016-01-28 17:15||bartjacobs||Note Added: 0015286|
|2016-02-03 12:15||doligez||Status||new => acknowledged|
|2016-02-03 12:15||doligez||Target Version||=> 4.03.0+dev / +beta1|
|2016-03-08 14:36||gasche||Note Added: 0015453|
|2016-03-08 14:36||gasche||Status||acknowledged => resolved|
|2016-03-08 14:36||gasche||Fixed in Version||=> 4.03.0+dev / +beta1|
|2016-03-08 14:36||gasche||Resolution||open => fixed|
|2016-03-08 14:36||gasche||Assigned To||=> gasche|
|2016-03-08 14:49||gasche||Note Added: 0015454|
|2016-03-08 14:49||gasche||Note Edited: 0015454||View Revisions|
|2016-03-08 14:49||gasche||Note Edited: 0015454||View Revisions|
|2016-03-08 14:50||gasche||Note Edited: 0015454||View Revisions|
|2017-02-23 16:35||doligez||Category||OCaml backend (code generation) => Back end (clambda to assembly)|
|2017-02-23 16:44||doligez||Category||Back end (clambda to assembly) => back end (clambda to assembly)|
|Copyright © 2000 - 2011 MantisBT Group|