Segfaults in bytecode callbacks #5166

vicuna · 2010-10-17T00:28:05Z

Original bug ID: 5166
Reporter: yziquel
Assigned to: @damiendoligez
Status: closed (set by @xavierleroy on 2015-12-11T18:28:13Z)
Resolution: unable to duplicate
Priority: normal
Severity: major
Version: 3.12.0
Category: runtime system and C interface
Tags: suggest_closing
Monitored by: phillyg yziquel @ygrek @glondu

Bug description

I've been experimenting a few things with OCaml callbacks from C++ back into OCaml. And been experiencing segfaults.

I've been storing functions in an OCaml array, then registering it with caml_register_generational_global_roots. These functions are callbacks that are called by a C++ instance. They are typically called with caml_callback3.

In native code, everything seems fine, callbacks work rather seamlessly (although I'm still waiting for the next segfault behind the corner). In bytecode, however, I get segfaults. The issue is the following: I've got a (value *) ml_error pointing to an OCaml value in my array. This value (*ml_error) points to some closure (**ml_error). The first word of this closure points to some OCaml bytecode (***ml_error).

However, when passing the value (*ml_error) to caml_callback3, I get segfaults. Following with gdb caml_callback3 into caml_callbackN_exn and then into caml_interprete, I found that the 'jmpq *%rax' instruction that dispatches opcodes into the bytecode interpreter doesn't interpret correctly the closure: After the opcodes ACC 6 and APPLY, %rax gets set to (**ml_error) and not to (***ml_error), then the bytecode interpreter jumps to a wrong address and segfaults.

Things work fine in native code. So I suspect that this is because the callbacks API would need more work for OCaml bytecode in order to be able to pass closures directly to caml_callbackN_exn.

I have in fact been experiencing this behaviour after wrapping

let callback1 = function x y z -> ...

into

let callback2 = function x y z -> callback1 x y (do_something z)

It worked with callback1, and started segfaulting when wrapping callback1 into callback2 and registering callback2 instead of callback1. So I am wondering if the extra layer of indirection comes from this wrapping, and if that is responsible for the segfault.

Moreover, I am wondering something about the LOCAL_CALLBACK_BYTECODE macro. By default, it does not seem set. However, I do not see the point of not making it the default. When debugging, seeing a global variable like this makes me worry about all kinds of crazy threading issues. Why not make it the default? I do not think that speed is really an issue for people who wish to use bytecode callbacks, so what is the point there?

vicuna · 2010-10-18T08:19:31Z

Comment author: yziquel

I have been trying to reproduce the error in a small code snippet, but have been unable to do so. Moreover, naming the arguments in the callback in the callback made the segfault disappear. I'll try to reproduce the error in a small code snippet. In the meantime, please feel free to ask privately for the crashing code.

vicuna · 2010-11-24T18:19:33Z

Comment author: phillyg

Hi, are you working on Linux? If so, I might be able to help you debug if you're still able to reproduce this same segfault on your machine. Please email me at philip@pgbovine.net if you'd like more details.

vicuna · 2011-05-20T12:51:12Z

Comment author: @damiendoligez

I'm worried about this:

I've got a (value *) ml_error pointing to an OCaml value in my array.

If your array is an OCaml array, it is supposed to contain values, not pointer to values.

I'll probably ask for the crashing code after the 3.12.1 release.

vicuna · 2011-05-22T21:24:40Z

Comment author: yziquel

Nope. I've got a C++ class with members such as:

private:
value callbacks_gc_root;
value * ml_error;

The constructor is a C++ initialisation list

public:
myclass (value callbacks_array):
callbacks_gc_root (callbacks_array),
ml_error ((reinterpret_cast<value *> (callbacks_array)) + OFFSET)
{
caml_register_generational_global_root(&callbacks_gc_root);
}

and an error member function:

void error (const int id, const int code, const std::string msg) {
caml_leave_blocking_section();
CAMLparam0();
CAMLlocal2(ml_id, ml_code);
ml_id = caml_copy_int32(id);
ml_code = caml_copy_int32(code);
value ml_msg = Val_owned_objstd::string
(new (std::nothrow) std::string(msg));
caml_callback3(*ml_error, ml_id, ml_code, ml_msg);
caml_local_roots = caml__frame;
caml_enter_blocking_section();
return;
}

Val_owned_obj<> is some custom templatised wrapper for C++ objects. Not the issue here.

ml_error is value * that is pointing to an OCaml value in the array. The array doesn't contain a value *, but a value. In fact, ml_error simply is callbacks_array + OFFSET using pointer arithmetics. So ml_error points into the array. The array is an array of closures.

type closure
external closure : ('a -> 'b) -> closure = "%identity"

extern "C" CAMLprim value myclass_ctor (value ml_callbacks) {
CAMLparam1(ml_callbacks);
CAMLreturn (Val_owned_obj
(new (std::nothrow) myclass (ml_callbacks)));
}

external myclass_ctor : closure array -> pointer = "myclass_ctor"
let myclass_ctor callbacks =
let module C = (val callbacks : CALLBACKS) in
myclass_ctor [|
...
closure C.error;
...
|]

Please ask for the full code when you need it. I have also followed step by step the execution of the bytecode interpreter to the point where it crashes and will gladly hand that to you.

Native code works fine (or rather worked, as this codebase has been dropped).

vicuna · 2012-02-09T18:56:11Z

Comment author: @damiendoligez

ml_error is value * that is pointing to an OCaml value in the array.

I don't see how it could cause your problem, but you are not allowed to point in the middle of an OCaml array. If the GC moves it, you're doomed.

Do you still have the code?

vicuna · 2012-02-09T20:26:58Z

Comment author: yziquel

I still have the code, and somewhere on my hard drive, I still have details of my investigation following step by step the bytecode interprete (caml_interprete() and what follows). However, it's not really in any presentable state, and I've somehow lost interest in it. If you have an interested in these incomplete and perhaps out of date code samples and trace of the bytecode interpreter, I'm willing to dig all this up.

I do not believe that this a GC issue for the following reason: systematic failure in bytecode mode, and systemic success in native code. Moreover, to my understanding, using caml_register_generational_global_root(&callbacks_gc_root) should fix the array properly in memory.

As far as I've been able to debug the bytecode, it seems to me that it's a problem with the calling convention for bytecode callbacks. If I remember well, the bytecode pushes things in some accumulator, and at one point, it pops it into rax, and you have then have a jump to the address pointed to by rax (or to an offset). This jump is the moment when the bytecode interpreter shifts interpretation to the next bytecode instruction.

The issue was that there was garbage in %rax.

But that is described in the initial bug report.

May I send the code to your (i.e. Damien's) address and not put it on the bug tracker?

vicuna · 2012-03-12T15:27:16Z

Comment author: @damiendoligez

GC or not, this still looks like a bug in OCaml. Please send me the code and I'll see if I can ferret it out.

vicuna · 2013-08-01T19:44:39Z

Comment author: @yallop

I think this report might involve a misunderstanding about roots.

It seems like the mental model you (yziquel) have is as follows: in general it's unsafe to keep pointers into OCaml objects, since the GC can move them around, but registering a root "fixes" the corresponding value in memory so that it won't be moved by the GC. According to this model, your code is correct: you register callbacks_gc_root (an alias for callbacks_array) as a root, so callbacks_array shouldn't be moved by the collector, and ml_error (calculated as an offset from callbacks_array) should reliably point to a member of the immovable array.

The problem is that roots don't work that way. When you register a root the message to the GC is not "please don't move this OCaml value"; rather, it's "please update this pointer whenever the OCaml value is moved". (Registering a root has other effects: it prevents the value being collected, for one thing, but that's not really what concerns us here.) The GC's responsibility to update roots explains why you have to pass the address of the value rather than the plain value to the caml_register_generational_global_root function.

The easiest way to fix the C code, then, is to change ml_error from a 'value *' to a plain 'value', and register ml_error itself as a root. Since roots are updated when the corresponding values move, you can then be sure that ml_error always points to the correct location.

vicuna · 2013-10-10T15:51:11Z

Comment author: @alainfrisch

yziquel: can you confirm yallop's interpretation?

vicuna closed this as completed Dec 11, 2015

vicuna added the stdlib label Mar 14, 2019

vicuna assigned damiendoligez Mar 14, 2019

vicuna added the bug label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfaults in bytecode callbacks #5166

Segfaults in bytecode callbacks #5166

vicuna commented Oct 17, 2010

vicuna commented Oct 18, 2010

vicuna commented Nov 24, 2010

vicuna commented May 20, 2011

vicuna commented May 22, 2011

vicuna commented Feb 9, 2012

vicuna commented Feb 9, 2012

vicuna commented Mar 12, 2012

vicuna commented Aug 1, 2013

vicuna commented Oct 10, 2013

Segfaults in bytecode callbacks #5166

Segfaults in bytecode callbacks #5166

Comments

vicuna commented Oct 17, 2010

Bug description

vicuna commented Oct 18, 2010

vicuna commented Nov 24, 2010

vicuna commented May 20, 2011

vicuna commented May 22, 2011

vicuna commented Feb 9, 2012

vicuna commented Feb 9, 2012

vicuna commented Mar 12, 2012

vicuna commented Aug 1, 2013

vicuna commented Oct 10, 2013