Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] In need of serious help regarding threading
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Xavier Leroy <Xavier.Leroy@i...>
Subject: Re: [Caml-list] In need of serious help regarding threading
> It appears (with the help of a friend good with a debugger) that
> there's something going wrong with my code, and how it's interacting
> with the GC.

Please don't get offended by what I'm going to say, but I have the
feeling that you're attacking extremely hard problems without adequate
debugging tools and without enough understanding of the OCaml runtime
system.

I'll try to provide some explanations nonetheless, but please don't
bombard this list with too many cries for help.

What the debugging session shows is a problem with return address
determination during the stack scanning performed by the GC.  To find
heap pointers contained in the stack, the GC scans it one frame at a
time, using compiler-generated frame descriptors to locate the
pointers.  The frame descriptors are keyed to the return address in
the Caml code through a hash table (variable frame_descriptors, hash table
lookup at lines 135-141 and 249-255 in file asmrun/roots.c).

Your run appears to be looping in the hash table lookup, indicating
that 1- the return address being looked up (variable retaddr) is not
in the table (this should never happen in normal operation), and 2-
your environment lets you dereference the NULL pointer without
crashing (bad idea!).

A good way to debug this is to print the value of the "retaddr" local
variable at lines 134 and 249 in asmrun/roots.c and correlate it with your
disassembly.  It should always refer to code addresses immediately
following a "call camlModule__function" or a "call caml_call_gc"
instruction.  While you're at it, print also the "sp" variable: it
should stay within the stack of a thread.  The problem is likely to
come from wrong values of the bottom_of_stack and last_return_address
starting points for the stack walk.

Your second test (Gc.full_major() in the main thread) further suggests
that the problem does not occur if the main thread is the one calling
the GC.  Try to put Gc.full_major() in another thread to see what
happens.  That could narrow the problem to the saving and restoring of
caml_bottom_of_stack and caml_last_return_address globals during
context switches.

Finally, notice that your stacks are tiny (4096 words???).  Unless
they are protected by guard pages, expect a lot of trouble when they
overflow (they will).

- Xavier Leroy