Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006507OCamlOCaml backend (code generation)public2014-08-01 16:332014-08-09 00:16
ReporterRichard Jones 
Assigned Toshinwell 
PrioritynormalSeveritycrashReproducibilityalways
StatusresolvedResolutionfixed 
Platformarm64OSOS Version
Product Version4.02.0+beta1 / +rc1 
Target Version4.02.0+devFixed in Version4.02.0+dev 
Summary0006507: arm64 crash when compiling coq
DescriptionAnother one, I'm afraid ...

$ gdb --args bin/coqtop.opt -boot -compile theories/FSets/FSetList
GNU gdb (GDB) Fedora 7.7.90.20140711-11.fc21
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> [^]
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>. [^]
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>. [^]
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bin/coqtop.opt...done.
(gdb) run
Starting program: /home/rjones/d/fedora/coq/master/coq-8.4pl4/bin/coqtop.opt -boot -compile theories/FSets/FSetList

Program received signal SIGSEGV, Segmentation fault.
caml_oldify_local_roots () at roots.c:190
190 if (d->retaddr == retaddr) break;
(gdb) bt
#0 caml_oldify_local_roots () at roots.c:190
#1 0x00000000009610dc in caml_empty_minor_heap () at minor_gc.c:233
#2 0x0000000000961214 in caml_minor_collection () at minor_gc.c:276
0000003 0x000000000095f0b0 in caml_garbage_collection () at signals_asm.c:70
0000004 0x0000000000971f38 in caml_call_gc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print d
$1 = (frame_descr *) 0x0
(gdb) print/x retaddr
$3 = 0x20002978d0
(gdb) print caml_frame_descriptors
$4 = (frame_descr **) 0xe0a370
(gdb) print h
$5 = 77594

Note this is not a general failure in coqtop.opt. The same program
runs probably hundreds of times during the build without failing, and
if I use `make -k' then it keeps running.

It only fails -- 100% reproducibly -- on this one input file.

Any suggestions how to debug this further?
Steps To ReproduceCompile coq on arm64.
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0011966)
Richard Jones (reporter)
2014-08-01 23:39

I was able to reproduce this with OCaml 4.02 branch from today (git commit 10e45753).
(0011971)
shinwell (developer)
2014-08-04 18:19
edited on: 2014-08-04 18:48

Richard, in which function is the return address you quote above? (0x20002978d0)

(Looking at your backtrace, I'm going to guess that it isn't in your executable -- is it in libc?)

(0011973)
Richard Jones (reporter)
2014-08-05 23:32

Different file in the Coq source, same failure:

$ gdb --args ./bin/coqtop.opt -boot -compile plugins/ring/LegacyArithRing
(gdb) run
Starting program: /home/rjones/d/coq-8.4pl4/bin/coqtop.opt -boot -compile plugins/ring/LegacyArithRing

Program received signal SIGSEGV, Segmentation fault.
0x000000000096177c in caml_oldify_local_roots () at roots.c:190
190 if (d->retaddr == retaddr) break;
(gdb) bt
#0 0x000000000096177c in caml_oldify_local_roots () at roots.c:190
#1 0x00000000009667a4 in caml_empty_minor_heap () at minor_gc.c:233
#2 0x0000000000966a18 in caml_minor_collection () at minor_gc.c:276
0000003 0x0000000000962efc in caml_garbage_collection () at signals_asm.c:70
0000004 0x000000000098156c in caml_call_gc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print d
$1 = (frame_descr *) 0x0
(gdb) print retaddr
$2 = 548543086592
(gdb) print/x retaddr
$3 = 0x7fb7b74000
(gdb) disassemble *0x7fb7b74000
No function contains specified address.

Looking at /proc/PID/maps of the process, I can see that the pointer
appears to lie inside anonymous malloc'd memory:

7fb7770000-7fb7d76000 rw-p 00000000 00:00 0
(0011974)
Richard Jones (reporter)
2014-08-05 23:36

(gdb) hexdump (0x7fb7b74000-0x100) 0x200
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 00 00 00 00 00 00 00 00 00 5f ad b6 7f 00 00 00 |........._......|
00000110 48 98 f6 af 7f 00 00 00 c8 40 b7 b7 7f 00 00 00 |H........@......|
00000120 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 |................|
00000130 00 08 00 00 00 00 00 00 10 41 b7 b7 7f 00 00 00 |.........A......|
00000140 d0 7e f6 af 7f 00 00 00 00 08 00 00 00 00 00 00 |.~..............|
00000150 38 40 b7 b7 7f 00 00 00 38 41 b7 b7 7f 00 00 00 |8@......8A......|
00000160 00 08 00 00 00 00 00 00 b8 4d b7 b7 7f 00 00 00 |.........M......|
00000170 b8 7e f6 af 7f 00 00 00 00 08 00 00 00 00 00 00 |.~..............|
00000180 68 40 b7 b7 7f 00 00 00 50 40 b7 b7 7f 00 00 00 |h@......P@......|
00000190 00 08 00 00 00 00 00 00 e8 4d b7 b7 7f 00 00 00 |.........M......|
000001a0 a0 7e f6 af 7f 00 00 00 00 08 00 00 00 00 00 00 |.~..............|
000001b0 98 40 b7 b7 7f 00 00 00 80 40 b7 b7 7f 00 00 00 |.@.......@......|
000001c0 f7 0c 00 00 00 00 00 00 d8 77 f4 af 7f 00 00 00 |.........w......|
000001d0 03 00 00 00 00 00 00 00 b0 40 b7 b7 7f 00 00 00 |.........@......|
000001e0 f7 10 00 00 00 00 00 00 f0 4b f4 af 7f 00 00 00 |.........K......|
000001f0 03 00 00 00 00 00 00 00 48 96 f6 af 7f 00 00 00 |........H.......|
00000200
(0011975)
shinwell (developer)
2014-08-06 10:32

Richard, could you please apply this patch to your 4.02 trunk compiler, and rebuild? Then please paste the lines starting "roots:" that arise from executing coqtop.opt on one of the two offending input files.

Index: asmrun/roots.c
===================================================================
--- asmrun/roots.c (revision 15041)
+++ asmrun/roots.c (working copy)
@@ -181,16 +181,20 @@
   sp = caml_bottom_of_stack;
   retaddr = caml_last_return_address;
   regs = caml_gc_regs;
+ fprintf(stderr, "roots: starting new cycle\n");
   if (sp != NULL) {
     while (1) {
       /* Find the descriptor corresponding to the return address */
+ fprintf(stderr, "roots: sp=%p retaddr=%p\n", sp, (void*) retaddr);
       h = Hash_retaddr(retaddr);
       while(1) {
         d = caml_frame_descriptors[h];
         if (d->retaddr == retaddr) break;
         h = (h+1) & caml_frame_descriptors_mask;
       }
+ fprintf(stderr, "roots: d=%p ", d);
       if (d->frame_size != 0xFFFF) {
+ fprintf(stderr, "frame_size=%ul num_live=%d\n", d->frame_size, d->num_live);
         /* Scan the roots in this frame */
         for (p = d->live_ofs, n = d->num_live; n > 0; n--, p++) {
           ofs = *p;
@@ -218,6 +222,7 @@
         /* This marks the top of a stack chunk for an ML callback.
            Skip C portion of stack and continue with next ML stack chunk. */
         struct caml_context * next_context = Callback_link(sp);
+ fprintf(stderr, "skipping C portion of stack\n");
         sp = next_context->bottom_of_stack;
         retaddr = next_context->last_retaddr;
         regs = next_context->gc_regs;
@@ -226,6 +231,7 @@
       }
     }
   }
+ fprintf(stderr, "roots: finished cycle\n");
   /* Local C roots */
   for (lr = caml_local_roots; lr != NULL; lr = lr->next) {
     for (i = 0; i < lr->ntables; i++){
(0011976)
shinwell (developer)
2014-08-06 10:38
edited on: 2014-08-06 10:39

Sorry, and also: having got those lines of output, please decode the "retaddr" addresses. A quick way to decode such an address should be:

gdb /my/executable -ex "inf sym 0x12345678" -ex "quit"

(0011996)
shinwell (developer)
2014-08-09 00:12
edited on: 2014-08-09 00:13

Fixed in trunk and 4.02.

[caml_bottom_of_stack] was 16 bytes too low only when the GC was entered via the assembly [caml_alloc] veneers. This occurred as a result of the stack pointer being recorded after the two spills in those veneers. The upshot was that the return address of the previous frame was being read from the wrong stack slot.

Ugh.

(Richard Jones kindly compiled Coq on a real aarch64 machine, and ran the tests, with an earlier version of this patch that I am confident is semantically identical to the one committed.)

(0011998)
shinwell (developer)
2014-08-09 00:16

Oh, and, I wonder if we can optimize those caml_alloc sequences by not saving x29 prior to Lcaml_call_gc. We can think about that later, after 4.02.

- Issue History
Date Modified Username Field Change
2014-08-01 16:33 Richard Jones New Issue
2014-08-01 23:39 Richard Jones Note Added: 0011966
2014-08-04 12:57 shinwell Assigned To => shinwell
2014-08-04 12:57 shinwell Status new => assigned
2014-08-04 18:19 shinwell Note Added: 0011971
2014-08-04 18:48 shinwell Note Edited: 0011971 View Revisions
2014-08-05 23:32 Richard Jones Note Added: 0011973
2014-08-05 23:36 Richard Jones Note Added: 0011974
2014-08-06 10:32 shinwell Note Added: 0011975
2014-08-06 10:38 shinwell Note Added: 0011976
2014-08-06 10:39 shinwell Note Edited: 0011976 View Revisions
2014-08-07 15:54 shinwell Target Version => 4.02.0+dev
2014-08-09 00:12 shinwell Note Added: 0011996
2014-08-09 00:12 shinwell Status assigned => resolved
2014-08-09 00:12 shinwell Fixed in Version => 4.02.0+dev
2014-08-09 00:12 shinwell Resolution open => fixed
2014-08-09 00:13 shinwell Note Edited: 0011996 View Revisions
2014-08-09 00:13 shinwell Note Edited: 0011996 View Revisions
2014-08-09 00:16 shinwell Note Added: 0011998


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker