Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0003019OCamlOCaml documentationpublic2004-07-30 14:012014-07-16 20:23
Reporteradministrator 
Assigned To 
PrioritynormalSeveritytextReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version 
Target Version4.03.0+devFixed in Version 
Summary0003019: POSIX-threads & segfaults
DescriptionHi,

I am currently developing a distributed file system for a customer
and have run into a serious problem with POSIX-threads (native code)
that leads to segfaults. There were also reports about instability with
VM-threads, but I haven't yet managed to reproduce that. I'm a bit at a
loss now where this problem really comes from, but I suspect that there
may be something wrong with the GC.

Here are some stack backtraces of a core dump (OCaml 3.08.0+1 / Linux 2.6.7):

The thread which raised this problem:

  (gdb) bt
  #0 0x08086484 in caml_do_local_roots ()
  #1 0x0807f786 in caml_thread_scan_roots ()
  #2 0x08086353 in caml_oldify_local_roots ()
  0000003 0x08087c2d in caml_empty_minor_heap ()
  0000004 0x08087d08 in caml_minor_collection ()
  0000005 0x08088773 in caml_alloc_string ()
  0000006 0x080857e8 in alloc_inet_addr ()
  0000007 0x08085a1e in alloc_sockaddr ()
  0000008 0x08085755 in unix_accept ()
  0000009 0x0805421b in camlMs_common_impl__start_1514 ()
  0000010 0x0000000b in ?? ()
  0000011 0xbffff9a8 in ?? ()
  0000012 0x080541f0 in camlMs_common_impl__start_1514 ()
  0000013 0x0000000b in ?? ()
  0000014 0x00000001 in ?? ()
  0000015 0x00000015 in ?? ()
  0000016 0x401d22ac in ?? ()
  0000017 0x0804ca38 in camlServer__entry ()
  0000018 0x080c39fc in ?? ()
  0000019 0x08095ef8 in camlServer__11 ()
  0000020 0x0804b9f9 in caml_startup__code_begin ()
  0000021 0x08092bfa in caml_start_program ()
  0000022 0x00000000 in ?? ()
  0000023 0xbffff9f8 in ?? ()
  0000024 0xbffffa20 in ?? ()
  0000025 0xbffffa94 in ?? ()
  0000026 0x080ab200 in caml_termination_hook ()
  #27 0x08085d62 in caml_main ()
  Previous frame inner to this frame (corrupt stack?)

The above thread doesn't do anything else but wait for network connections
and start a thread for each of those.

The other threads:

(gdb) info threads
  6 process 21185 0x401343c7 in select () from /lib/tls/libc.so.6
  5 process 21188 0x40031266 in __lll_mutex_lock_wait ()
   from /lib/tls/libpthread.so.0
  4 process 21189 0x40031266 in __lll_mutex_lock_wait ()
   from /lib/tls/libpthread.so.0
  3 process 21190 0x401343c7 in select () from /lib/tls/libc.so.6
  2 process 21204 0x40134667 in sync () from /lib/tls/libc.so.6
* 1 process 21184 0x08086484 in caml_do_local_roots ()

Thread #2 was just handling a write transaction and spent time in
the "sync" system call while thread #1 crashed.

Thread 0000003 waits with Thread.delay and periodically unlocks a mutex.
Nothing special here.

Thread 0000004 just locks on a mutex to wait for shutdown requests.

Thread 0000005 locks on the mutex which is periodically released by thread 0000003.

Thread 0000006 is the "tick" thread.

The initial part of the disassembled code of "caml_do_local_roots"
looks as follows:

  (gdb) disassemble
  Dump of assembler code for function caml_do_local_roots:
  0x0808644a <caml_do_local_roots+0>: push %ebp
  0x0808644b <caml_do_local_roots+1>: mov %esp,%ebp
  0x0808644d <caml_do_local_roots+3>: push %edi
  0x0808644e <caml_do_local_roots+4>: push %esi
  0x0808644f <caml_do_local_roots+5>: push %ebx
  0x08086450 <caml_do_local_roots+6>: sub $0x1c,%esp
  0x08086453 <caml_do_local_roots+9>: mov 0xc(%ebp),%eax
  0x08086456 <caml_do_local_roots+12>: mov %eax,0xfffffff0(%ebp)
  0x08086459 <caml_do_local_roots+15>: mov 0x10(%ebp),%ebx
  0x0808645c <caml_do_local_roots+18>: mov 0x14(%ebp),%edx
  0x0808645f <caml_do_local_roots+21>: mov %edx,0xffffffec(%ebp)
  0x08086462 <caml_do_local_roots+24>: test %eax,%eax
  0x08086464 <caml_do_local_roots+26>: je 0x80864f6 <caml_do_local_roots+172>
  0x0808646a <caml_do_local_roots+32>: mov %ebx,%eax
  0x0808646c <caml_do_local_roots+34>: shr $0x3,%eax
  0x0808646f <caml_do_local_roots+37>: and 0x80ab328,%eax
  0x08086475 <caml_do_local_roots+43>: mov 0x80ab318,%ecx
  0x0808647b <caml_do_local_roots+49>: mov 0x80ab328,%edx
  0x08086481 <caml_do_local_roots+55>: mov (%ecx,%eax,4),%edi
  0x08086484 <caml_do_local_roots+58>: cmp %ebx,(%edi)
  0x08086486 <caml_do_local_roots+60>: je 0x808648d <caml_do_local_roots+67>
  [snip]

So the segfault happens due to the cmp-opcode. Inspecting the registers
it turns out that %edi is NULL :-(

  (gdb) info registers
  eax 0x1 1
  ecx 0x80d9ef0 135110384
  edx 0x1fff 8191
  ebx 0x1 1
  esp 0xbffff768 0xbffff768
  ebp 0xbffff790 0xbffff790
  esi 0x0 0
  edi 0x0 0
  eip 0x8086484 0x8086484
  eflags 0x10202 66050
  cs 0x73 115
  ss 0x7b 123
  ds 0x7b 123
  es 0x7b 123
  fs 0x0 0
  gs 0x33 51

Some more details about the stack frame:

  (gdb) info frame
  Stack level 0, frame at 0xbffff798:
   eip = 0x8086484 in caml_do_local_roots; saved eip 0x807f786
   called by frame at 0xbffff7c8
   Arglist at 0xbffff790, args:
   Locals at 0xbffff790, Previous frame's sp is 0xbffff798
   Saved registers:
    ebx at 0xbffff784, ebp at 0xbffff790, esi at 0xbffff788, edi at 0xbffff78c,
    eip at 0xbffff794

If you have an intuition of what might be going wrong, I'd find it easier
to isolate the problem. What should I be looking out for to help you
track down the bug?

Best regards,
Markus

--
Markus Mottl http://www.oefai.at/~markus [^] markus@oefai.at

TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0003086)
administrator (administrator)
2004-08-06 15:05

Dear Markus,

It's hard to diagnose GC problems just from a core dump and the
limited info you provided. However, assuming the stack backtrace
below is correct, I am surprised that it goes straight from
camlMs_common_impl__start_1514 to unix_accept with going through
camlUnix__accept_XXXX and caml_c_call.

> (gdb) bt
> #0 0x08086484 in caml_do_local_roots ()
> #1 0x0807f786 in caml_thread_scan_roots ()
> #2 0x08086353 in caml_oldify_local_roots ()
> 0000003 0x08087c2d in caml_empty_minor_heap ()
> 0000004 0x08087d08 in caml_minor_collection ()
> 0000005 0x08088773 in caml_alloc_string ()
> 0000006 0x080857e8 in alloc_inet_addr ()
> 0000007 0x08085a1e in alloc_sockaddr ()
> 0000008 0x08085755 in unix_accept ()
> 0000009 0x0805421b in camlMs_common_impl__start_1514 ()

You do call Unix.accept in Ms_common_impl and not a re-declaration of
the "unix_accept" external that would claim it to be "noalloc", right?

Best wishes,

- Xavier Leroy

(0003087)
administrator (administrator)
2004-08-06 15:05

Need more info.
(0003088)
administrator (administrator)
2004-08-06 16:05

Dear Xavier,

On Fri, 06 Aug 2004, Xavier Leroy wrote:
> It's hard to diagnose GC problems just from a core dump and the
> limited info you provided. However, assuming the stack backtrace
> below is correct, I am surprised that it goes straight from
> camlMs_common_impl__start_1514 to unix_accept with going through
> camlUnix__accept_XXXX and caml_c_call.
>
> > (gdb) bt
> > #0 0x08086484 in caml_do_local_roots ()
> > #1 0x0807f786 in caml_thread_scan_roots ()
> > #2 0x08086353 in caml_oldify_local_roots ()
> > 0000003 0x08087c2d in caml_empty_minor_heap ()
> > 0000004 0x08087d08 in caml_minor_collection ()
> > 0000005 0x08088773 in caml_alloc_string ()
> > 0000006 0x080857e8 in alloc_inet_addr ()
> > 0000007 0x08085a1e in alloc_sockaddr ()
> > 0000008 0x08085755 in unix_accept ()
> > 0000009 0x0805421b in camlMs_common_impl__start_1514 ()
>
> You do call Unix.accept in Ms_common_impl and not a re-declaration of
> the "unix_accept" external that would claim it to be "noalloc", right?

No, I am not using any other implementation of "unix_accept". But I
have some other (unrelated) functions that were declared "noalloc",
which used "caml_{enter,leave}_blocking_section" - and that's bad,
because I also have signal handlers. The segfaults have stopped since
I removed these declarations.

There are still some other, luckily seldom problems (freezing/unresponsive
servers) in my project now that seem thread-related, but it's very
difficult to judge whether this is a bug in my code, OCaml, GLIBC or
even the kernel (I only recently found a kernel bug - one can never
trust any part of the system...).

I consider this particular problem fixed right now. Btw., could you
please add documentation about the declarations "noalloc" + "float",
and about "caml_{enter,leave}_blocking_section" to the manual?

Best regards,
Markus

--
Markus Mottl http://www.oefai.at/~markus [^] markus@oefai.at

(0006842)
doligez (administrator)
2012-01-30 15:33

caml_{enter,leave}_blocking_section are documented as of 3.12.1.
We still need to document "noalloc" and "float".

- Issue History
Date Modified Username Field Change
2005-11-18 10:14 administrator New Issue
2012-01-30 15:33 doligez Note Added: 0006842
2012-07-11 17:33 doligez Target Version => 4.01.0+dev
2012-07-11 17:33 doligez Description Updated View Revisions
2012-07-31 13:37 doligez Target Version 4.01.0+dev => 4.00.1+dev
2012-09-11 09:27 doligez Target Version 4.00.1+dev => 4.01.0+dev
2013-08-19 19:19 doligez Severity minor => text
2013-08-19 19:19 doligez Resolution unable to reproduce => open
2013-08-19 19:19 doligez Category OCaml general => OCaml documentation
2013-08-19 19:19 doligez Target Version 4.01.0+dev => 4.01.1+dev
2014-05-25 20:20 doligez Target Version 4.01.1+dev => 4.02.0+dev
2014-07-16 20:23 doligez Target Version 4.02.0+dev => 4.03.0+dev


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker