Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ocamlopt.opt on aarch64 runs out of memory compiling camlp4 #6486

Closed
vicuna opened this issue Jul 14, 2014 · 11 comments
Closed

ocamlopt.opt on aarch64 runs out of memory compiling camlp4 #6486

vicuna opened this issue Jul 14, 2014 · 11 comments

Comments

@vicuna
Copy link

vicuna commented Jul 14, 2014

Original bug ID: 6486
Reporter: Richard Jones
Assigned to: @mshinwell
Status: resolved (set by @mshinwell on 2014-07-18T14:57:02Z)
Resolution: fixed
Priority: high
Severity: crash
Version: 4.02.0+beta1 / +rc1
Target version: 4.02.0+dev
Category: back end (clambda to assembly)
Related to: #6484 #7307

Bug description

Note this is with ocaml 4.02.0 from git (8c1e5cd), on aarch64.

Compiling camlp4 from source gives the error:

  • /usr/bin/ocamlopt.opt -c -g -w a -I camlp4/import -warn-error A-3 -I camlp4/config -I camlp4/boot -o camlp4/boot/Camlp4.cmx camlp4/boot/Camlp4.ml
    Fatal error: out of memory.
    Command exited with code 2.
    Makefile:9: recipe for target 'byte' failed
    make: *** [byte] Error 10

Note it is highly unlike this is really running out of memory,
since the machine has 16 GB of RAM. It's also a real aarch64
machine, not emulation.

There is no core dump. Is there a way to get a core dump at
the point where Out_of_memory is raised?

@vicuna
Copy link
Author

vicuna commented Jul 14, 2014

Comment author: @mshinwell

Ensure that the compiler is being built with -g, in the toplevel Makefile.
Then just run it under gdb, break on "exit", and you can get a backtrace just before it quits.

@vicuna
Copy link
Author

vicuna commented Jul 14, 2014

Comment author: Richard Jones

I gave up trying to bisect this bug as there is no git commit which is "good" on arm64. Back to debugging it the old fashioned way.

The stack trace from 'exit' is:

#0 0x000003ffb7d68794 in exit () from /lib64/libc.so.6
#1 0x00000000005ed6e8 in caml_fatal_error (
msg=msg@entry=0x5eee78 "Fatal error: out of memory.\n") at misc.c:55
#2 0x00000000005dd24c in caml_alloc_shr (
wosize=wosize@entry=12926428316172288, tag=) at memory.c:414
#3 0x00000000005dc3e4 in caml_oldify_one (v=,
p=) at minor_gc.c:129
#4 0x00000000005da4bc in caml_oldify_local_roots () at roots.c:202
#5 0x00000000005dc5e4 in caml_empty_minor_heap () at minor_gc.c:233
#6 0x00000000005dc71c in caml_minor_collection () at minor_gc.c:276
#7 0x00000000005db5a0 in caml_garbage_collection () at signals_asm.c:70
#8 0x00000000005ecba8 in caml_call_gc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

That looks like a very big allocation.

(gdb) frame 3
#3 0x00000000005dc3e4 in caml_oldify_one (v=,
p=) at minor_gc.c:129
129 result = caml_alloc_shr (sz, tag);
(gdb) print sz
$1 = 12926428316172288

Looks like some kind of stack frame corruption to me.

@vicuna
Copy link
Author

vicuna commented Jul 15, 2014

Comment author: @mshinwell

Oh dear. Looks like corruption of the OCaml heap. I suspect something has clobbered the header of the block [v]. It might be instructive to find out what did that; here is a suggestion.

  1. Find the value of the pointer "v" when it fails, either by inspecting the disassembly and the registers, or by recompiling the runtime without -O and just printing "v" (using the debug runtime should do this for you).
  2. Add a watchpoint on ((uint64_t*)v)[-1]. (When I do this on x86, I find that I need to run the program a little, e.g. up to [caml_start_program], before adding the watchpoint since gdb tends to hang otherwise.)
  3. Use the "commands" command in gdb to print a stack trace and continue every time the location is changed, since it might happen a lot. E.g. if the watchpoint is breakpoint 1:
    (gdb) commands 1
    bt
    c
    end
  4. Run it and see what happens :)

I'm also wondering if this is actually a duplicate of 6484, although I see that one of these is arm32 and another arm64.

Also, is it usually the case that backtraces on aarch64 stop at "caml_call_gc"? If so, that appears to be another bug.

Is it possible to have this machine connected to a public IP address?

@vicuna
Copy link
Author

vicuna commented Jul 15, 2014

Comment author: Richard Jones

git bisect is rather unhelpful:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
0cba565
9639370
2633ff7
452390e
979fe8b
9c1d005
29b3443
95d98cd
558f40e
We cannot bisect more!

Sorry, no can do re public IP. This machine is under a strict NDA.

@vicuna
Copy link
Author

vicuna commented Jul 15, 2014

Comment author: @mshinwell

One other thing to try: run the offending command in a loop, with varying minor heap sizes (set via OCAMLRUNPARAM) across a large range--say 256k to 100Mb. Use the debug runtime. With any luck, the behaviour will change, and it might trip an assertion earlier.

@vicuna
Copy link
Author

vicuna commented Jul 15, 2014

Comment author: Richard Jones

I have what I believe is a correct git-bisect result:

558f40e is the first bad commit
commit 558f40e
Author: Xavier Leroy xavier.leroy@inria.fr
Date: Sat Apr 26 10:40:22 2014 +0000

New back-end optimization pass: common subexpression elimination (CSE).
(Reuses results of previous computations instead of recomputing them.)
(Cherry-picked from branch backend-optim.)
Tested on amd64/linux and i386/linux.
Other back-ends compile (after assorted updates) but are untested.


git-svn-id: http://caml.inria.fr/svn/ocaml/trunk@14688 f963ae5c-01c2-4b8c-9fe0-0dff7051ff02

:100644 100644 9162031faaa1bcb266ddf5e135d982d0f01e0bd6 1d36a9c892753927df5ea95df0b91fe8f8b88299 M .depend
:100644 100644 3f32d5deea45c3d421802545d5f4d90bff0af227 0e4c420099515d8c9c7a370377c65b3c3c334222 M Changes
:100644 100644 594c650e7732ffab4212b353113af583349c05f8 877df08ef0454c5101b8e1cf9462497f6836557f M Makefile
:040000 040000 10cdd050d2d3620a6ebb0fe39285de5a04755fb4 a9f945ea43f901d00b8d0f620e7bfff719749323 M asmcomp
:040000 040000 14fa06cecbd95f1444ba1a3b847a7a991e22ffc1 0daa36b438f0e8565d0a7d52335da2fd5c4258fe M driver
:040000 040000 03a40a7a2bfba24d9ca388c5a1e5f73c2ec76bad 06a94ebfa8ec8ab0b10f00e09ed057695d6e81ef M tools
:040000 040000 8f8503672a267e43160bc6bd3f2ac82dfcdd48f8 ff3627639c6054cb82862ec94f2b591b74c080e4 M utils

@vicuna
Copy link
Author

vicuna commented Jul 15, 2014

Comment author: Richard Jones

I am able to fix this by a very drastic approach: Turning off CSE entirely. The patch is actually quite small although obviously not generally applicable. However at least I have an idea what's going wrong now (something in asmcomp/arm64/CSE.ml).

diff --git a/asmcomp/CSEgen.ml b/asmcomp/CSEgen.ml
index 19019e1..260e4fa 100644
--- a/asmcomp/CSEgen.ml
+++ b/asmcomp/CSEgen.ml
@@ -180,7 +180,8 @@ method private keep_checkbounds n =
(* Perform CSE on the given instruction [i] and its successors.
[n] is the value numbering current at the beginning of [i]. *)

-method private cse n i =
+method private cse n i = i
+(*
match i.desc with
| Iend | Ireturn | Iop(Itailcall_ind) | Iop(Itailcall_imm _)
| Iexit _ | Iraise _ ->
@@ -262,6 +263,7 @@ method private cse n i =
{i with desc = Itrywith(self#cse n body,
self#cse empty_numbering handler);
next = self#cse empty_numbering i.next}
+*)

method fundecl f =
{f with fun_body = self#cse empty_numbering f.fun_body}

@vicuna
Copy link
Author

vicuna commented Jul 15, 2014

Comment author: @mshinwell

Does this fix 6484 as well?

@vicuna
Copy link
Author

vicuna commented Jul 16, 2014

Comment author: Richard Jones

Yes disabling CSE fixes #6484 as well.

I tried varying the minor heap size and using the debug runtime, but was not able to hit any assertions.

I wasn't able to monitor the heap address for reasons outlined in email.

@vicuna
Copy link
Author

vicuna commented Jul 18, 2014

Comment author: @xavierleroy

The fix for 6484 (commit 15012 on version/4.02, 15013 on trunk) is likely to fix this one too. Let us know how it is going.

@vicuna
Copy link
Author

vicuna commented Jul 18, 2014

Comment author: Richard Jones

Thanks Mark, Xavier. I have confirmed this fixes the camlp4 build on arm64.

I will add this to the Fedora compiler and that will give it more exposure and testing (to both 32 and 64 bit ARM) over the next few weeks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants