Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes in garbage collector #7307

Closed
vicuna opened this issue Jul 27, 2016 · 4 comments
Closed

Crashes in garbage collector #7307

vicuna opened this issue Jul 27, 2016 · 4 comments

Comments

@vicuna
Copy link

vicuna commented Jul 27, 2016

Original bug ID: 7307
Reporter: @stijn-devriendt
Status: resolved (set by @damiendoligez on 2016-09-02T15:35:03Z)
Resolution: fixed
Priority: normal
Severity: crash
Platform: amd64
OS: Ubuntu
OS Version: 14.04
Version: 4.01.0
Fixed in version: 4.03.0
Category: back end (clambda to assembly)
Related to: #6484 #6486

Bug description

In very specific conditions, using the "%caml_string_get64" builtin causes crashes in the garbage collector. See the additional information for some deep-dive info of the compiler output, referenced in the explanation below.

The break_things method in the referenced program constructs an 8-byte string with a specific value (more on that later), then repeatedly calls %caml_string_get64 on it and stores the result in a list, in order to fill up the minor heap with live data.
At some point in time, this function will call the garbage collector. Looking at the disassembly, what happens is that the 8-byte load has already been done, leaving the contents in %rsi while calling the GC.
The GC will then walk the stack in caml_darken_local_roots and find that the frametable expects %rsi to contain a "value".

Following that, it will call Oldify on it. Oldify performs some additional checks, most prominently that value must point to the minor heap (which is why we craft the value from the address of the string via address_of, we also subtract 10 (8 to be before the string header and 2 to obtain an unaligned pointer, easily recognisable in gdb - on a target that doesn't support unaligned loads the crashes might even be easier to debug).

Ultimately this crashes the GC. In my runs of the attached progrem it crashes in caml_oldify_one when allocating through caml_alloc_shr with "Fatal Error: out of memory" (due to being called during minor heap collection).

In our real-world case, we use "%caml_string_get64" to deserialize network and file data. This crashed in 2 locations:

  • with debug run-time we have sporadic crashes in caml_darken where it checks that the block color is not blue (this entirely depends on the data of the value and whatever it points to). A good extra assert in that function would be:
    Assert((v & 1) || ((v & (sizeof(intnat)-1)) == 0)). In fact, this assert could be useful in more places.
  • in normal builds we've typically seen crashes in mark_slice, most likely when it has already promoted the incorrect value to the major heap.

Most of our runs take a while to crash, presumably because the ever-incrementing 64-bit number needs to fall within the minor heap address range.

Steps to reproduce

ocamlopt -o test test.ml
./test

Optionally, place a breakpoint inside caml_oldify_local_roots (or caml_do_roots) where iterating the live_ofs while tracking the retaddr for the proper call and *root for the tainted value.

Additional information

Sample code: (test.ml)
external set64: string -> int -> int64 -> unit = "%caml_string_set64"
external get64: string -> int -> int64 = "%caml_string_get64"

let address_of (x:'a) : int64 =
if Obj.is_block (Obj.repr x) then
Int64.shift_left (Int64.of_int (Obj.magic x)) 1
else
invalid_arg "Can only find address of boxed values"

let rec break_things str lst =
let newval = get64 str 0 in
break_things str (newval :: lst)

let () =
let str = String.create 8 in
let address = address_of str in
let () = Printf.printf "Address = 0x%08LX" address in
let () = set64 str 0 (Int64.sub address 10L) in
break_things str []

Relevant disassembly:
0000000000403460 <camlTest__break_things_1012>:
403460: 55 push rbp
403461: 48 89 e5 mov rbp,rsp
403464: 48 83 ec 00 sub rsp,0x0
403468: 48 89 c7 mov rdi,rax
40346b: 48 c7 c0 07 00 00 00 mov rax,0x7
403472: 48 83 c0 00 add rax,0x0
403476: 48 8b 77 f8 mov rsi,QWORD PTR [rdi-0x8]
40347a: 48 c1 ee 0a shr rsi,0xa
40347e: 48 8d 34 f5 ff ff ff lea rsi,[rsi8-0x1]
403485: ff
403486: 48 0f b6 14 37 movzx rdx,BYTE PTR [rdi+rsi
1]
40348b: 48 29 d6 sub rsi,rdx
40348e: 48 39 c6 cmp rsi,rax
403491: 76 4f jbe 4034e2 <camlTest__break_things_1012+0x82>
403493: 48 8b 37 mov rsi,QWORD PTR [rdi] ##### << Load from caml_string_get64
403496: 49 83 ef 30 sub r15,0x30
40349a: 48 8d 05 b7 30 23 00 lea rax,[rip+0x2330b7] # 636558 <caml_young_limit>
4034a1: 4c 3b 38 cmp r15,QWORD PTR [rax]
4034a4: 72 35 jb 4034db <camlTest__break_things_1012+0x7b> ### <<< call to GC with rsi a plain int64
4034a6: 49 8d 47 08 lea rax,[r15+0x8]
4034aa: 48 c7 40 f8 ff 08 00 mov QWORD PTR [rax-0x8],0x8ff
4034b1: 00
4034b2: 48 8d 15 27 0b 23 00 lea rdx,[rip+0x230b27] # 633fe0 <caml_int64_ops>
4034b9: 48 89 10 mov QWORD PTR [rax],rdx
4034bc: 48 89 70 08 mov QWORD PTR [rax+0x8],rsi
4034c0: 48 8d 70 18 lea rsi,[rax+0x18]
4034c4: 48 c7 46 f8 00 08 00 mov QWORD PTR [rsi-0x8],0x800
4034cb: 00
4034cc: 48 89 06 mov QWORD PTR [rsi],rax
4034cf: 48 89 5e 08 mov QWORD PTR [rsi+0x8],rbx
4034d3: 48 89 f8 mov rax,rdi
4034d6: 48 89 f3 mov rbx,rsi
4034d9: eb 8d jmp 403468 <camlTest__break_things_1012+0x8>
4034db: e8 f0 b0 01 00 call 41e5d0 <caml_call_gc>
4034e0: eb b4 jmp 403496 <camlTest__break_things_1012+0x36>
4034e2: e8 dd b4 01 00 call 41e9c4 <caml_ml_array_bound_error>
4034e7: 66 0f 1f 84 00 00 00 nop WORD PTR [rax+rax*1+0x0]
4034ee: 00 00

caml_frametable:
0x62dd90: 0x00000000004034e0 0x0003000700030010
0x62dda0: 0x0000000000000005
==> retaddr = 0x4034e0, frame_size=0x10, num_live=3, live_ofs={7, 3, 5}
==> translates to registers 3, 1, 2 = rsi, rbx, rdi

@vicuna
Copy link
Author

vicuna commented Jul 27, 2016

Comment author: @xavierleroy

Thanks for such a precise report and analysis.

I can reproduce the problem with OCaml 4.02.3 but not with 4.03.0.

Actually, I'm pretty sure this particular bug was fixed in OCaml 4.03.0 as part of this commit:
ac02f56

Let us know if you see the problem manifesting itself with 4.03.0.

@vicuna
Copy link
Author

vicuna commented Jul 29, 2016

Comment author: @stijn-devriendt

Unfortunately, porting our codebase over to 4.03 is quite a sizeable endeavor. We will likely revisit our anticipated schedule for this move, however.

I did statically verify that the 4.03 compiler generates proper frametables for the provided testcase. As you say, that patch indeed looks like the proper fix for this issue.

Another thing I've been trying to assess from the patch is whether other code constructs (besides bigarray and the caml_string_getX builtins) are affected by this bug. It looks like any code calling into the GC, while non-OCaml-values are in registers or on the stack may be affected, but it's rather hard for me to confirm this claim. If you could shine your light on this, it would be much appreciated.

I'm fine with closing this bug as duplicate of #6486.

@vicuna
Copy link
Author

vicuna commented Jul 29, 2016

Comment author: @stijn-devriendt

Also, for now we'll be using the non-builtin variants of caml_string_getX and our preliminary testing has survived the duration in which we otherwise started seeing crashes.

@vicuna vicuna closed this as completed Sep 2, 2016
@vicuna
Copy link
Author

vicuna commented Sep 2, 2016

Comment author: @damiendoligez

Marking as "resolved" for the moment. Don't hesitate to reopen this PR if you reproduce the problem with 4.03.0 or later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant