Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document "noalloc" annotation on primitive declarations #3019

Closed
vicuna opened this issue Jul 30, 2004 · 5 comments
Closed

Document "noalloc" annotation on primitive declarations #3019

vicuna opened this issue Jul 30, 2004 · 5 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Jul 30, 2004

Original bug ID: 3019
Reporter: administrator
Assigned to: @alainfrisch
Status: closed (set by @xavierleroy on 2017-02-16T14:18:20Z)
Resolution: fixed
Priority: normal
Severity: text
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: documentation

Bug description

Hi,

I am currently developing a distributed file system for a customer
and have run into a serious problem with POSIX-threads (native code)
that leads to segfaults. There were also reports about instability with
VM-threads, but I haven't yet managed to reproduce that. I'm a bit at a
loss now where this problem really comes from, but I suspect that there
may be something wrong with the GC.

Here are some stack backtraces of a core dump (OCaml 3.08.0+1 / Linux 2.6.7):

The thread which raised this problem:

(gdb) bt
#0 0x08086484 in caml_do_local_roots ()
#1 0x0807f786 in caml_thread_scan_roots ()
#2 0x08086353 in caml_oldify_local_roots ()
#3 0x08087c2d in caml_empty_minor_heap ()
#4 0x08087d08 in caml_minor_collection ()
#5 0x08088773 in caml_alloc_string ()
#6 0x080857e8 in alloc_inet_addr ()
#7 0x08085a1e in alloc_sockaddr ()
#8 0x08085755 in unix_accept ()
#9 0x0805421b in camlMs_common_impl__start_1514 ()
#10 0x0000000b in ?? ()
#11 0xbffff9a8 in ?? ()
#12 0x080541f0 in camlMs_common_impl__start_1514 ()
#13 0x0000000b in ?? ()
#14 0x00000001 in ?? ()
#15 0x00000015 in ?? ()
#16 0x401d22ac in ?? ()
#17 0x0804ca38 in camlServer__entry ()
#18 0x080c39fc in ?? ()
#19 0x08095ef8 in camlServer__11 ()
#20 0x0804b9f9 in caml_startup__code_begin ()
#21 0x08092bfa in caml_start_program ()
#22 0x00000000 in ?? ()
#23 0xbffff9f8 in ?? ()
#24 0xbffffa20 in ?? ()
#25 0xbffffa94 in ?? ()
#26 0x080ab200 in caml_termination_hook ()
#27 0x08085d62 in caml_main ()
Previous frame inner to this frame (corrupt stack?)

The above thread doesn't do anything else but wait for network connections
and start a thread for each of those.

The other threads:

(gdb) info threads
6 process 21185 0x401343c7 in select () from /lib/tls/libc.so.6
5 process 21188 0x40031266 in __lll_mutex_lock_wait ()
from /lib/tls/libpthread.so.0
4 process 21189 0x40031266 in __lll_mutex_lock_wait ()
from /lib/tls/libpthread.so.0
3 process 21190 0x401343c7 in select () from /lib/tls/libc.so.6
2 process 21204 0x40134667 in sync () from /lib/tls/libc.so.6

  • 1 process 21184 0x08086484 in caml_do_local_roots ()

Thread #2 was just handling a write transaction and spent time in
the "sync" system call while thread #1 crashed.

Thread #3 waits with Thread.delay and periodically unlocks a mutex.
Nothing special here.

Thread #4 just locks on a mutex to wait for shutdown requests.

Thread #5 locks on the mutex which is periodically released by thread #3.

Thread #6 is the "tick" thread.

The initial part of the disassembled code of "caml_do_local_roots"
looks as follows:

(gdb) disassemble
Dump of assembler code for function caml_do_local_roots:
0x0808644a <caml_do_local_roots+0>: push %ebp
0x0808644b <caml_do_local_roots+1>: mov %esp,%ebp
0x0808644d <caml_do_local_roots+3>: push %edi
0x0808644e <caml_do_local_roots+4>: push %esi
0x0808644f <caml_do_local_roots+5>: push %ebx
0x08086450 <caml_do_local_roots+6>: sub $0x1c,%esp
0x08086453 <caml_do_local_roots+9>: mov 0xc(%ebp),%eax
0x08086456 <caml_do_local_roots+12>: mov %eax,0xfffffff0(%ebp)
0x08086459 <caml_do_local_roots+15>: mov 0x10(%ebp),%ebx
0x0808645c <caml_do_local_roots+18>: mov 0x14(%ebp),%edx
0x0808645f <caml_do_local_roots+21>: mov %edx,0xffffffec(%ebp)
0x08086462 <caml_do_local_roots+24>: test %eax,%eax
0x08086464 <caml_do_local_roots+26>: je 0x80864f6 <caml_do_local_roots+172>
0x0808646a <caml_do_local_roots+32>: mov %ebx,%eax
0x0808646c <caml_do_local_roots+34>: shr $0x3,%eax
0x0808646f <caml_do_local_roots+37>: and 0x80ab328,%eax
0x08086475 <caml_do_local_roots+43>: mov 0x80ab318,%ecx
0x0808647b <caml_do_local_roots+49>: mov 0x80ab328,%edx
0x08086481 <caml_do_local_roots+55>: mov (%ecx,%eax,4),%edi
0x08086484 <caml_do_local_roots+58>: cmp %ebx,(%edi)
0x08086486 <caml_do_local_roots+60>: je 0x808648d <caml_do_local_roots+67>
[snip]

So the segfault happens due to the cmp-opcode. Inspecting the registers
it turns out that %edi is NULL :-(

(gdb) info registers
eax 0x1 1
ecx 0x80d9ef0 135110384
edx 0x1fff 8191
ebx 0x1 1
esp 0xbffff768 0xbffff768
ebp 0xbffff790 0xbffff790
esi 0x0 0
edi 0x0 0
eip 0x8086484 0x8086484
eflags 0x10202 66050
cs 0x73 115
ss 0x7b 123
ds 0x7b 123
es 0x7b 123
fs 0x0 0
gs 0x33 51

Some more details about the stack frame:

(gdb) info frame
Stack level 0, frame at 0xbffff798:
eip = 0x8086484 in caml_do_local_roots; saved eip 0x807f786
called by frame at 0xbffff7c8
Arglist at 0xbffff790, args:
Locals at 0xbffff790, Previous frame's sp is 0xbffff798
Saved registers:
ebx at 0xbffff784, ebp at 0xbffff790, esi at 0xbffff788, edi at 0xbffff78c,
eip at 0xbffff794

If you have an intuition of what might be going wrong, I'd find it easier
to isolate the problem. What should I be looking out for to help you
track down the bug?

Best regards,
Markus

--
Markus Mottl http://www.oefai.at/~markus markus@oefai.at

@vicuna
Copy link
Author

vicuna commented Aug 6, 2004

Comment author: administrator

Dear Markus,

It's hard to diagnose GC problems just from a core dump and the
limited info you provided. However, assuming the stack backtrace
below is correct, I am surprised that it goes straight from
camlMs_common_impl__start_1514 to unix_accept with going through
camlUnix__accept_XXXX and caml_c_call.

(gdb) bt
#0 0x08086484 in caml_do_local_roots ()
#1 0x0807f786 in caml_thread_scan_roots ()
#2 0x08086353 in caml_oldify_local_roots ()
#3 0x08087c2d in caml_empty_minor_heap ()
#4 0x08087d08 in caml_minor_collection ()
#5 0x08088773 in caml_alloc_string ()
#6 0x080857e8 in alloc_inet_addr ()
#7 0x08085a1e in alloc_sockaddr ()
#8 0x08085755 in unix_accept ()
#9 0x0805421b in camlMs_common_impl__start_1514 ()

You do call Unix.accept in Ms_common_impl and not a re-declaration of
the "unix_accept" external that would claim it to be "noalloc", right?

Best wishes,

  • Xavier Leroy

@vicuna
Copy link
Author

vicuna commented Aug 6, 2004

Comment author: administrator

Need more info.

@vicuna
Copy link
Author

vicuna commented Aug 6, 2004

Comment author: administrator

Dear Xavier,

On Fri, 06 Aug 2004, Xavier Leroy wrote:

It's hard to diagnose GC problems just from a core dump and the
limited info you provided. However, assuming the stack backtrace
below is correct, I am surprised that it goes straight from
camlMs_common_impl__start_1514 to unix_accept with going through
camlUnix__accept_XXXX and caml_c_call.

(gdb) bt
#0 0x08086484 in caml_do_local_roots ()
#1 0x0807f786 in caml_thread_scan_roots ()
#2 0x08086353 in caml_oldify_local_roots ()
#3 0x08087c2d in caml_empty_minor_heap ()
#4 0x08087d08 in caml_minor_collection ()
#5 0x08088773 in caml_alloc_string ()
#6 0x080857e8 in alloc_inet_addr ()
#7 0x08085a1e in alloc_sockaddr ()
#8 0x08085755 in unix_accept ()
#9 0x0805421b in camlMs_common_impl__start_1514 ()

You do call Unix.accept in Ms_common_impl and not a re-declaration of
the "unix_accept" external that would claim it to be "noalloc", right?

No, I am not using any other implementation of "unix_accept". But I
have some other (unrelated) functions that were declared "noalloc",
which used "caml_{enter,leave}_blocking_section" - and that's bad,
because I also have signal handlers. The segfaults have stopped since
I removed these declarations.

There are still some other, luckily seldom problems (freezing/unresponsive
servers) in my project now that seem thread-related, but it's very
difficult to judge whether this is a bug in my code, OCaml, GLIBC or
even the kernel (I only recently found a kernel bug - one can never
trust any part of the system...).

I consider this particular problem fixed right now. Btw., could you
please add documentation about the declarations "noalloc" + "float",
and about "caml_{enter,leave}_blocking_section" to the manual?

Best regards,
Markus

--
Markus Mottl http://www.oefai.at/~markus markus@oefai.at

@vicuna
Copy link
Author

vicuna commented Jan 30, 2012

Comment author: @damiendoligez

caml_{enter,leave}_blocking_section are documented as of 3.12.1.
We still need to document "noalloc" and "float".

@vicuna
Copy link
Author

vicuna commented Dec 9, 2015

Comment author: @alainfrisch

New attributes @unboxed, @noalloc are documented.

@vicuna vicuna closed this as completed Feb 16, 2017
@vicuna vicuna added this to the 4.03.0 milestone Mar 14, 2019
@vicuna vicuna added the bug label Mar 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants