| Anonymous | Login | Signup for a new account | 2013-06-18 07:31 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||||
| 0003019 | OCaml | OCaml general | public | 2004-07-30 14:01 | 2012-09-11 09:27 | ||||||
| Reporter | administrator | ||||||||||
| Assigned To | |||||||||||
| Priority | normal | Severity | minor | Reproducibility | always | ||||||
| Status | acknowledged | Resolution | unable to reproduce | ||||||||
| Platform | OS | OS Version | |||||||||
| Product Version | |||||||||||
| Target Version | 4.01.0+dev | Fixed in Version | |||||||||
| Summary | 0003019: POSIX-threads & segfaults | ||||||||||
| Description | Hi, I am currently developing a distributed file system for a customer and have run into a serious problem with POSIX-threads (native code) that leads to segfaults. There were also reports about instability with VM-threads, but I haven't yet managed to reproduce that. I'm a bit at a loss now where this problem really comes from, but I suspect that there may be something wrong with the GC. Here are some stack backtraces of a core dump (OCaml 3.08.0+1 / Linux 2.6.7): The thread which raised this problem: (gdb) bt #0 0x08086484 in caml_do_local_roots () #1 0x0807f786 in caml_thread_scan_roots () #2 0x08086353 in caml_oldify_local_roots () 0000003 0x08087c2d in caml_empty_minor_heap () 0000004 0x08087d08 in caml_minor_collection () 0000005 0x08088773 in caml_alloc_string () 0000006 0x080857e8 in alloc_inet_addr () 0000007 0x08085a1e in alloc_sockaddr () 0000008 0x08085755 in unix_accept () 0000009 0x0805421b in camlMs_common_impl__start_1514 () 0000010 0x0000000b in ?? () 0000011 0xbffff9a8 in ?? () 0000012 0x080541f0 in camlMs_common_impl__start_1514 () 0000013 0x0000000b in ?? () 0000014 0x00000001 in ?? () 0000015 0x00000015 in ?? () 0000016 0x401d22ac in ?? () 0000017 0x0804ca38 in camlServer__entry () 0000018 0x080c39fc in ?? () 0000019 0x08095ef8 in camlServer__11 () 0000020 0x0804b9f9 in caml_startup__code_begin () 0000021 0x08092bfa in caml_start_program () 0000022 0x00000000 in ?? () 0000023 0xbffff9f8 in ?? () 0000024 0xbffffa20 in ?? () 0000025 0xbffffa94 in ?? () 0000026 0x080ab200 in caml_termination_hook () #27 0x08085d62 in caml_main () Previous frame inner to this frame (corrupt stack?) The above thread doesn't do anything else but wait for network connections and start a thread for each of those. The other threads: (gdb) info threads 6 process 21185 0x401343c7 in select () from /lib/tls/libc.so.6 5 process 21188 0x40031266 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 4 process 21189 0x40031266 in __lll_mutex_lock_wait () from /lib/tls/libpthread.so.0 3 process 21190 0x401343c7 in select () from /lib/tls/libc.so.6 2 process 21204 0x40134667 in sync () from /lib/tls/libc.so.6 * 1 process 21184 0x08086484 in caml_do_local_roots () Thread #2 was just handling a write transaction and spent time in the "sync" system call while thread #1 crashed. Thread 0000003 waits with Thread.delay and periodically unlocks a mutex. Nothing special here. Thread 0000004 just locks on a mutex to wait for shutdown requests. Thread 0000005 locks on the mutex which is periodically released by thread 0000003. Thread 0000006 is the "tick" thread. The initial part of the disassembled code of "caml_do_local_roots" looks as follows: (gdb) disassemble Dump of assembler code for function caml_do_local_roots: 0x0808644a <caml_do_local_roots+0>: push %ebp 0x0808644b <caml_do_local_roots+1>: mov %esp,%ebp 0x0808644d <caml_do_local_roots+3>: push %edi 0x0808644e <caml_do_local_roots+4>: push %esi 0x0808644f <caml_do_local_roots+5>: push %ebx 0x08086450 <caml_do_local_roots+6>: sub $0x1c,%esp 0x08086453 <caml_do_local_roots+9>: mov 0xc(%ebp),%eax 0x08086456 <caml_do_local_roots+12>: mov %eax,0xfffffff0(%ebp) 0x08086459 <caml_do_local_roots+15>: mov 0x10(%ebp),%ebx 0x0808645c <caml_do_local_roots+18>: mov 0x14(%ebp),%edx 0x0808645f <caml_do_local_roots+21>: mov %edx,0xffffffec(%ebp) 0x08086462 <caml_do_local_roots+24>: test %eax,%eax 0x08086464 <caml_do_local_roots+26>: je 0x80864f6 <caml_do_local_roots+172> 0x0808646a <caml_do_local_roots+32>: mov %ebx,%eax 0x0808646c <caml_do_local_roots+34>: shr $0x3,%eax 0x0808646f <caml_do_local_roots+37>: and 0x80ab328,%eax 0x08086475 <caml_do_local_roots+43>: mov 0x80ab318,%ecx 0x0808647b <caml_do_local_roots+49>: mov 0x80ab328,%edx 0x08086481 <caml_do_local_roots+55>: mov (%ecx,%eax,4),%edi 0x08086484 <caml_do_local_roots+58>: cmp %ebx,(%edi) 0x08086486 <caml_do_local_roots+60>: je 0x808648d <caml_do_local_roots+67> [snip] So the segfault happens due to the cmp-opcode. Inspecting the registers it turns out that %edi is NULL :-( (gdb) info registers eax 0x1 1 ecx 0x80d9ef0 135110384 edx 0x1fff 8191 ebx 0x1 1 esp 0xbffff768 0xbffff768 ebp 0xbffff790 0xbffff790 esi 0x0 0 edi 0x0 0 eip 0x8086484 0x8086484 eflags 0x10202 66050 cs 0x73 115 ss 0x7b 123 ds 0x7b 123 es 0x7b 123 fs 0x0 0 gs 0x33 51 Some more details about the stack frame: (gdb) info frame Stack level 0, frame at 0xbffff798: eip = 0x8086484 in caml_do_local_roots; saved eip 0x807f786 called by frame at 0xbffff7c8 Arglist at 0xbffff790, args: Locals at 0xbffff790, Previous frame's sp is 0xbffff798 Saved registers: ebx at 0xbffff784, ebp at 0xbffff790, esi at 0xbffff788, edi at 0xbffff78c, eip at 0xbffff794 If you have an intuition of what might be going wrong, I'd find it easier to isolate the problem. What should I be looking out for to help you track down the bug? Best regards, Markus -- Markus Mottl http://www.oefai.at/~markus [^] markus@oefai.at | ||||||||||
| Tags | No tags attached. | ||||||||||
| Attached Files | |||||||||||
Notes |
|
|
(0003086) administrator (administrator) 2004-08-06 15:05 |
Dear Markus, It's hard to diagnose GC problems just from a core dump and the limited info you provided. However, assuming the stack backtrace below is correct, I am surprised that it goes straight from camlMs_common_impl__start_1514 to unix_accept with going through camlUnix__accept_XXXX and caml_c_call. > (gdb) bt > #0 0x08086484 in caml_do_local_roots () > #1 0x0807f786 in caml_thread_scan_roots () > #2 0x08086353 in caml_oldify_local_roots () > 0000003 0x08087c2d in caml_empty_minor_heap () > 0000004 0x08087d08 in caml_minor_collection () > 0000005 0x08088773 in caml_alloc_string () > 0000006 0x080857e8 in alloc_inet_addr () > 0000007 0x08085a1e in alloc_sockaddr () > 0000008 0x08085755 in unix_accept () > 0000009 0x0805421b in camlMs_common_impl__start_1514 () You do call Unix.accept in Ms_common_impl and not a re-declaration of the "unix_accept" external that would claim it to be "noalloc", right? Best wishes, - Xavier Leroy |
|
(0003087) administrator (administrator) 2004-08-06 15:05 |
Need more info. |
|
(0003088) administrator (administrator) 2004-08-06 16:05 |
Dear Xavier, On Fri, 06 Aug 2004, Xavier Leroy wrote: > It's hard to diagnose GC problems just from a core dump and the > limited info you provided. However, assuming the stack backtrace > below is correct, I am surprised that it goes straight from > camlMs_common_impl__start_1514 to unix_accept with going through > camlUnix__accept_XXXX and caml_c_call. > > > (gdb) bt > > #0 0x08086484 in caml_do_local_roots () > > #1 0x0807f786 in caml_thread_scan_roots () > > #2 0x08086353 in caml_oldify_local_roots () > > 0000003 0x08087c2d in caml_empty_minor_heap () > > 0000004 0x08087d08 in caml_minor_collection () > > 0000005 0x08088773 in caml_alloc_string () > > 0000006 0x080857e8 in alloc_inet_addr () > > 0000007 0x08085a1e in alloc_sockaddr () > > 0000008 0x08085755 in unix_accept () > > 0000009 0x0805421b in camlMs_common_impl__start_1514 () > > You do call Unix.accept in Ms_common_impl and not a re-declaration of > the "unix_accept" external that would claim it to be "noalloc", right? No, I am not using any other implementation of "unix_accept". But I have some other (unrelated) functions that were declared "noalloc", which used "caml_{enter,leave}_blocking_section" - and that's bad, because I also have signal handlers. The segfaults have stopped since I removed these declarations. There are still some other, luckily seldom problems (freezing/unresponsive servers) in my project now that seem thread-related, but it's very difficult to judge whether this is a bug in my code, OCaml, GLIBC or even the kernel (I only recently found a kernel bug - one can never trust any part of the system...). I consider this particular problem fixed right now. Btw., could you please add documentation about the declarations "noalloc" + "float", and about "caml_{enter,leave}_blocking_section" to the manual? Best regards, Markus -- Markus Mottl http://www.oefai.at/~markus [^] markus@oefai.at |
|
(0006842) doligez (manager) 2012-01-30 15:33 |
caml_{enter,leave}_blocking_section are documented as of 3.12.1. We still need to document "noalloc" and "float". |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2005-11-18 10:14 | administrator | New Issue | |
| 2012-01-30 15:33 | doligez | Note Added: 0006842 | |
| 2012-07-11 17:33 | doligez | Target Version | => 4.01.0+dev |
| 2012-07-11 17:33 | doligez | Description Updated | View Revisions |
| 2012-07-31 13:37 | doligez | Target Version | 4.01.0+dev => 4.00.1+dev |
| 2012-09-11 09:27 | doligez | Target Version | 4.00.1+dev => 4.01.0+dev |
| Copyright © 2000 - 2011 MantisBT Group |