Version française
Home     About     Download     Resources     Contact us    
Browse thread
marshal and C structures crash
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Andres Varon <avaron@g...>
Subject: marshal and C structures crash
Hello Everyone,

I would like to ask a question regarding a bug I have been observing  
in one program, which I have been unable to fix:

The program in question is a large phylogenetic analysis application  
(bioinformatics), which has been written in OCaml and C. It's almost  
ready for public beta testing _excepting_ for this particular bug.  
The bulk of the code is in OCaml (~70.000 LOC), and a small fraction  
of core functions in C (obviously it's hard to post the code in  
question). It runs both in sequential and parallel versions using  
MPI, and uses heavily polymorphic variants, functors, and object  
oriented features, where each fit better our requirements.

I had the parallel version broken for a while, but it used to run  
without a problem. Few weeks ago, when I updated the code for  
parallel runs (using a master-slave distributed model), I started to  
observe slaves segfaulting after a while. I nailed down the problem  
to some marshal related issue that I can reproduce in the sequential  
versions by doing the following:

1. load some data in the program and marshal what I would have sent  
to a slave in a file
2. run the program in a loop that unmarshals the data from the file,  
and repeats a short script. The loop usually ends with a crash (few  
iterations).

The data structure being marshaled is pure OCaml (Sets and Maps of  
other ocaml structures), and so all C structures (wrapped with a  
custom tag), are produced locally. The segfault happens if the  
computations are concentrated in either one of the only two C custom  
types, which where programmed independently by two of us (extremely  
different computations).

If I don't do the unmarshal step, but run the previous loop by just  
reading the data from the input files, the program works flawlessly,  
and tools such as valgrind, watch points I have set in gdb, and lots  
of  assertions in our C and ocaml code, pass every test. I also have  
checks for every array access in our C side to ensure that each  
access and write occurs within bounds.

However, if the data comes from the marshaled channel, after few  
iterations the program segfaults, and the reason appears to be  
(according to valgrind, and all my attempts to detect a failure  as  
early as possible), that some custom type is free while still alive  
from the OCaml side (what I catch is a double free, or that the  
contents of a DNA sequence is invalid because it has been free  
already). Note, again, that I am completely unable to reproduce the  
issue (even a single warning or assertion failure), unless I  
unmarshal the data to start with. Moreover, the error occurs with two  
data structures that where programmed independently by two  
experienced OCaml programmers. I believe that OCaml is duplicating  
the custom type and therefore I get two ocaml values pointing at the  
same C structure, is that possible?. I though one of the C types uses  
a pool of arrays to speedup some computations, the other one only has  
one pointer, going from the Ocaml custom type to the C structure, and  
from there to a couple of arrays, that's it.   Also note that every  
type is treated as an immutable data structure, and we provide no in- 
place modifications in our OCaml interface.

Of course, I have been hunting a bug in my C functions and can't find  
anything that could cause the double free (the only way to call  
seq_CAML_free is from the garbage collector!), or an out of bounds  
write. Is there anything special about marshaling that could be  
causing this? Even some particular pattern in the way OCaml allocates  
memory for the unmarshaling step? Any ideas about what the problem  
could be or where should I look at?

As you see, I'm lost; I just don't see where else can I place a check  
in our code.

For those of you who reached this line of my email, thanks for the  
effort! I will listen at any ideas that could pop up in your minds.

best,

Andrés Varón
American Museum of Natural History