Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006486OCamlOCaml backend (code generation)public2014-07-14 15:332014-08-11 09:15
ReporterRichard Jones 
Assigned Toshinwell 
PriorityhighSeveritycrashReproducibilityalways
StatusresolvedResolutionfixed 
PlatformOSOS Version
Product Version4.02.0+beta1 / +rc1 
Target Version4.02.0+devFixed in Version 
Summary0006486: ocamlopt.opt on aarch64 runs out of memory compiling camlp4
DescriptionNote this is with ocaml 4.02.0 from git (8c1e5cdf), on aarch64.

Compiling camlp4 from source gives the error:

+ /usr/bin/ocamlopt.opt -c -g -w a -I camlp4/import -warn-error A-3 -I camlp4/config -I camlp4/boot -o camlp4/boot/Camlp4.cmx camlp4/boot/Camlp4.ml
Fatal error: out of memory.
Command exited with code 2.
Makefile:9: recipe for target 'byte' failed
make: *** [byte] Error 10

Note it is highly unlike this is really running out of memory,
since the machine has 16 GB of RAM. It's also a real aarch64
machine, not emulation.

There is no core dump. Is there a way to get a core dump at
the point where Out_of_memory is raised?
TagsNo tags attached.
Attached Files

- Relationships
related to 0006484resolvedshinwell ocamlopt.opt on 32 bit arm segfaults compiling ounit 2.0.0 

-  Notes
(0011808)
shinwell (developer)
2014-07-14 15:45

Ensure that the compiler is being built with -g, in the toplevel Makefile.
Then just run it under gdb, break on "exit", and you can get a backtrace just before it quits.
(0011815)
Richard Jones (reporter)
2014-07-14 20:34

I gave up trying to bisect this bug as there is no git commit which is "good" on arm64. Back to debugging it the old fashioned way.

The stack trace from 'exit' is:

#0 0x000003ffb7d68794 in exit () from /lib64/libc.so.6
#1 0x00000000005ed6e8 in caml_fatal_error (
    msg=msg@entry=0x5eee78 "Fatal error: out of memory.\n") at misc.c:55
#2 0x00000000005dd24c in caml_alloc_shr (
    wosize=wosize@entry=12926428316172288, tag=<optimized out>) at memory.c:414
0000003 0x00000000005dc3e4 in caml_oldify_one (v=<optimized out>,
    p=<optimized out>) at minor_gc.c:129
0000004 0x00000000005da4bc in caml_oldify_local_roots () at roots.c:202
0000005 0x00000000005dc5e4 in caml_empty_minor_heap () at minor_gc.c:233
0000006 0x00000000005dc71c in caml_minor_collection () at minor_gc.c:276
0000007 0x00000000005db5a0 in caml_garbage_collection () at signals_asm.c:70
0000008 0x00000000005ecba8 in caml_call_gc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

That looks like a very big allocation.

(gdb) frame 3
0000003 0x00000000005dc3e4 in caml_oldify_one (v=<optimized out>,
    p=<optimized out>) at minor_gc.c:129
129 result = caml_alloc_shr (sz, tag);
(gdb) print sz
$1 = 12926428316172288

Looks like some kind of stack frame corruption to me.
(0011817)
shinwell (developer)
2014-07-15 10:42

Oh dear. Looks like corruption of the OCaml heap. I suspect something has clobbered the header of the block [v]. It might be instructive to find out what did that; here is a suggestion.
1. Find the value of the pointer "v" when it fails, either by inspecting the disassembly and the registers, or by recompiling the runtime without -O and just printing "v" (using the debug runtime should do this for you).
2. Add a watchpoint on ((uint64_t*)v)[-1]. (When I do this on x86, I find that I need to run the program a little, e.g. up to [caml_start_program], before adding the watchpoint since gdb tends to hang otherwise.)
3. Use the "commands" command in gdb to print a stack trace and continue every time the location is changed, since it might happen a lot. E.g. if the watchpoint is breakpoint 1:
(gdb) commands 1
bt
c
end
4. Run it and see what happens :)

I'm also wondering if this is actually a duplicate of 6484, although I see that one of these is arm32 and another arm64.

Also, is it usually the case that backtraces on aarch64 stop at "caml_call_gc"? If so, that appears to be another bug.

Is it possible to have this machine connected to a public IP address?
(0011820)
Richard Jones (reporter)
2014-07-15 11:07

git bisect is rather unhelpful:

There are only 'skip'ped commits left to test.
The first bad commit could be any of:
0cba565437e617cc5826cad64f4a1212e00fc1ae
9639370d40a4dd5d880b4aba4dd570c8c8b7b343
2633ff77ced16e223678b46d07067e541b233687
452390e0eadaafe92ff9d2c9d008035dfdb878f9
979fe8b8adb7a8d8c824e277b9cb7ca1c1cc9a77
9c1d005ebb21b9eff2804ac4d80450251ffe6b5a
29b34438e08e26ae8f8623eb32bb524386f0532f
95d98cd9782c0577b0c7290f6535b29e7bd4cd41
558f40e3446854913d5ce011441c4b10da03f27e
We cannot bisect more!

Sorry, no can do re public IP. This machine is under a strict NDA.
(0011821)
shinwell (developer)
2014-07-15 11:36

One other thing to try: run the offending command in a loop, with varying minor heap sizes (set via OCAMLRUNPARAM) across a large range--say 256k to 100Mb. Use the debug runtime. With any luck, the behaviour will change, and it might trip an assertion earlier.
(0011822)
Richard Jones (reporter)
2014-07-15 12:30

I have what I believe is a correct git-bisect result:

558f40e3446854913d5ce011441c4b10da03f27e is the first bad commit
commit 558f40e3446854913d5ce011441c4b10da03f27e
Author: Xavier Leroy <xavier.leroy@inria.fr>
Date: Sat Apr 26 10:40:22 2014 +0000

    New back-end optimization pass: common subexpression elimination (CSE).
    (Reuses results of previous computations instead of recomputing them.)
    (Cherry-picked from branch backend-optim.)
    Tested on amd64/linux and i386/linux.
    Other back-ends compile (after assorted updates) but are untested.
    
    
    git-svn-id: http://caml.inria.fr/svn/ocaml/trunk@14688 [^] f963ae5c-01c2-4b8c-9fe0-0dff7051ff02

:100644 100644 9162031faaa1bcb266ddf5e135d982d0f01e0bd6 1d36a9c892753927df5ea95df0b91fe8f8b88299 M .depend
:100644 100644 3f32d5deea45c3d421802545d5f4d90bff0af227 0e4c420099515d8c9c7a370377c65b3c3c334222 M Changes
:100644 100644 594c650e7732ffab4212b353113af583349c05f8 877df08ef0454c5101b8e1cf9462497f6836557f M Makefile
:040000 040000 10cdd050d2d3620a6ebb0fe39285de5a04755fb4 a9f945ea43f901d00b8d0f620e7bfff719749323 M asmcomp
:040000 040000 14fa06cecbd95f1444ba1a3b847a7a991e22ffc1 0daa36b438f0e8565d0a7d52335da2fd5c4258fe M driver
:040000 040000 03a40a7a2bfba24d9ca388c5a1e5f73c2ec76bad 06a94ebfa8ec8ab0b10f00e09ed057695d6e81ef M tools
:040000 040000 8f8503672a267e43160bc6bd3f2ac82dfcdd48f8 ff3627639c6054cb82862ec94f2b591b74c080e4 M utils
(0011823)
Richard Jones (reporter)
2014-07-15 13:08

I am able to fix this by a very drastic approach: Turning off CSE entirely. The patch is actually quite small although obviously not generally applicable. However at least I have an idea what's going wrong now (something in asmcomp/arm64/CSE.ml).

diff --git a/asmcomp/CSEgen.ml b/asmcomp/CSEgen.ml
index 19019e1..260e4fa 100644
--- a/asmcomp/CSEgen.ml
+++ b/asmcomp/CSEgen.ml
@@ -180,7 +180,8 @@ method private keep_checkbounds n =
 (* Perform CSE on the given instruction [i] and its successors.
    [n] is the value numbering current at the beginning of [i]. *)
 
-method private cse n i =
+method private cse n i = i
+(*
   match i.desc with
   | Iend | Ireturn | Iop(Itailcall_ind) | Iop(Itailcall_imm _)
   | Iexit _ | Iraise _ ->
@@ -262,6 +263,7 @@ method private cse n i =
       {i with desc = Itrywith(self#cse n body,
                               self#cse empty_numbering handler);
               next = self#cse empty_numbering i.next}
+*)
 
 method fundecl f =
   {f with fun_body = self#cse empty_numbering f.fun_body}
(0011824)
shinwell (developer)
2014-07-15 13:33

Does this fix 6484 as well?
(0011840)
Richard Jones (reporter)
2014-07-16 13:30

Yes disabling CSE fixes 0006484 as well.

I tried varying the minor heap size and using the debug runtime, but was not able to hit any assertions.

I wasn't able to monitor the heap address for reasons outlined in email.
(0011885)
xleroy (administrator)
2014-07-18 16:15

The fix for 6484 (commit 15012 on version/4.02, 15013 on trunk) is likely to fix this one too. Let us know how it is going.
(0011886)
Richard Jones (reporter)
2014-07-18 16:25

Thanks Mark, Xavier. I have confirmed this fixes the camlp4 build on arm64.

I will add this to the Fedora compiler and that will give it more exposure and testing (to both 32 and 64 bit ARM) over the next few weeks.

- Issue History
Date Modified Username Field Change
2014-07-14 15:33 Richard Jones New Issue
2014-07-14 15:45 shinwell Note Added: 0011808
2014-07-14 17:16 shinwell Status new => acknowledged
2014-07-14 20:34 Richard Jones Note Added: 0011815
2014-07-15 10:42 shinwell Note Added: 0011817
2014-07-15 10:42 shinwell Assigned To => shinwell
2014-07-15 10:42 shinwell Status acknowledged => assigned
2014-07-15 11:07 Richard Jones Note Added: 0011820
2014-07-15 11:36 shinwell Note Added: 0011821
2014-07-15 12:30 Richard Jones Note Added: 0011822
2014-07-15 13:08 Richard Jones Note Added: 0011823
2014-07-15 13:33 shinwell Note Added: 0011824
2014-07-16 13:30 Richard Jones Note Added: 0011840
2014-07-16 16:38 doligez Priority normal => high
2014-07-16 16:38 doligez Target Version => 4.02.0+dev
2014-07-18 16:04 xleroy Relationship added related to 0006484
2014-07-18 16:15 xleroy Note Added: 0011885
2014-07-18 16:25 Richard Jones Note Added: 0011886
2014-07-18 16:57 shinwell Status assigned => resolved
2014-07-18 16:57 shinwell Resolution open => fixed


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker