Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arm64 crash when compiling coq #6507

Closed
vicuna opened this issue Aug 1, 2014 · 8 comments
Closed

arm64 crash when compiling coq #6507

vicuna opened this issue Aug 1, 2014 · 8 comments
Assignees
Milestone

Comments

@vicuna
Copy link

vicuna commented Aug 1, 2014

Original bug ID: 6507
Reporter: Richard Jones
Assigned to: @mshinwell
Status: closed (set by @xavierleroy on 2016-12-07T10:34:26Z)
Resolution: fixed
Priority: normal
Severity: crash
Platform: arm64
Version: 4.02.0+beta1 / +rc1
Target version: 4.02.0+dev
Fixed in version: 4.02.0+dev
Category: back end (clambda to assembly)
Monitored by: @gasche @mshinwell

Bug description

Another one, I'm afraid ...

$ gdb --args bin/coqtop.opt -boot -compile theories/FSets/FSetList
GNU gdb (GDB) Fedora 7.7.90.20140711-11.fc21
Copyright (C) 2014 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "aarch64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from bin/coqtop.opt...done.
(gdb) run
Starting program: /home/rjones/d/fedora/coq/master/coq-8.4pl4/bin/coqtop.opt -boot -compile theories/FSets/FSetList

Program received signal SIGSEGV, Segmentation fault.
caml_oldify_local_roots () at roots.c:190
190 if (d->retaddr == retaddr) break;
(gdb) bt
#0 caml_oldify_local_roots () at roots.c:190
#1 0x00000000009610dc in caml_empty_minor_heap () at minor_gc.c:233
#2 0x0000000000961214 in caml_minor_collection () at minor_gc.c:276
#3 0x000000000095f0b0 in caml_garbage_collection () at signals_asm.c:70
#4 0x0000000000971f38 in caml_call_gc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print d
$1 = (frame_descr *) 0x0
(gdb) print/x retaddr
$3 = 0x20002978d0
(gdb) print caml_frame_descriptors
$4 = (frame_descr **) 0xe0a370
(gdb) print h
$5 = 77594

Note this is not a general failure in coqtop.opt. The same program
runs probably hundreds of times during the build without failing, and
if I use `make -k' then it keeps running.

It only fails -- 100% reproducibly -- on this one input file.

Any suggestions how to debug this further?

Steps to reproduce

Compile coq on arm64.

@vicuna
Copy link
Author

vicuna commented Aug 1, 2014

Comment author: Richard Jones

I was able to reproduce this with OCaml 4.02 branch from today (git commit 10e4575).

@vicuna
Copy link
Author

vicuna commented Aug 4, 2014

Comment author: @mshinwell

Richard, in which function is the return address you quote above? (0x20002978d0)

(Looking at your backtrace, I'm going to guess that it isn't in your executable -- is it in libc?)

@vicuna
Copy link
Author

vicuna commented Aug 5, 2014

Comment author: Richard Jones

Different file in the Coq source, same failure:

$ gdb --args ./bin/coqtop.opt -boot -compile plugins/ring/LegacyArithRing
(gdb) run
Starting program: /home/rjones/d/coq-8.4pl4/bin/coqtop.opt -boot -compile plugins/ring/LegacyArithRing

Program received signal SIGSEGV, Segmentation fault.
0x000000000096177c in caml_oldify_local_roots () at roots.c:190
190 if (d->retaddr == retaddr) break;
(gdb) bt
#0 0x000000000096177c in caml_oldify_local_roots () at roots.c:190
#1 0x00000000009667a4 in caml_empty_minor_heap () at minor_gc.c:233
#2 0x0000000000966a18 in caml_minor_collection () at minor_gc.c:276
#3 0x0000000000962efc in caml_garbage_collection () at signals_asm.c:70
#4 0x000000000098156c in caml_call_gc ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
(gdb) print d
$1 = (frame_descr *) 0x0
(gdb) print retaddr
$2 = 548543086592
(gdb) print/x retaddr
$3 = 0x7fb7b74000
(gdb) disassemble *0x7fb7b74000
No function contains specified address.

Looking at /proc/PID/maps of the process, I can see that the pointer
appears to lie inside anonymous malloc'd memory:

7fb7770000-7fb7d76000 rw-p 00000000 00:00 0

@vicuna
Copy link
Author

vicuna commented Aug 5, 2014

Comment author: Richard Jones

(gdb) hexdump (0x7fb7b74000-0x100) 0x200
00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00000100 00 00 00 00 00 00 00 00 00 5f ad b6 7f 00 00 00 |........._......|
00000110 48 98 f6 af 7f 00 00 00 c8 40 b7 b7 7f 00 00 00 |H........@......|
00000120 01 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 |................|
00000130 00 08 00 00 00 00 00 00 10 41 b7 b7 7f 00 00 00 |.........A......|
00000140 d0 7e f6 af 7f 00 00 00 00 08 00 00 00 00 00 00 |...............|
00000150 38 40 b7 b7 7f 00 00 00 38 41 b7 b7 7f 00 00 00 |8@......8A......|
00000160 00 08 00 00 00 00 00 00 b8 4d b7 b7 7f 00 00 00 |.........M......|
00000170 b8 7e f6 af 7f 00 00 00 00 08 00 00 00 00 00 00 |.
..............|
00000180 68 40 b7 b7 7f 00 00 00 50 40 b7 b7 7f 00 00 00 |h@......P@......|
00000190 00 08 00 00 00 00 00 00 e8 4d b7 b7 7f 00 00 00 |.........M......|
000001a0 a0 7e f6 af 7f 00 00 00 00 08 00 00 00 00 00 00 |.~..............|
000001b0 98 40 b7 b7 7f 00 00 00 80 40 b7 b7 7f 00 00 00 |.@.......@......|
000001c0 f7 0c 00 00 00 00 00 00 d8 77 f4 af 7f 00 00 00 |.........w......|
000001d0 03 00 00 00 00 00 00 00 b0 40 b7 b7 7f 00 00 00 |.........@......|
000001e0 f7 10 00 00 00 00 00 00 f0 4b f4 af 7f 00 00 00 |.........K......|
000001f0 03 00 00 00 00 00 00 00 48 96 f6 af 7f 00 00 00 |........H.......|
00000200

@vicuna
Copy link
Author

vicuna commented Aug 6, 2014

Comment author: @mshinwell

Richard, could you please apply this patch to your 4.02 trunk compiler, and rebuild? Then please paste the lines starting "roots:" that arise from executing coqtop.opt on one of the two offending input files.

Index: asmrun/roots.c

--- asmrun/roots.c (revision 15041)
+++ asmrun/roots.c (working copy)
@@ -181,16 +181,20 @@
sp = caml_bottom_of_stack;
retaddr = caml_last_return_address;
regs = caml_gc_regs;

  • fprintf(stderr, "roots: starting new cycle\n");
    if (sp != NULL) {
    while (1) {
    /* Find the descriptor corresponding to the return address */
  •  fprintf(stderr, "roots: sp=%p retaddr=%p\n", sp, (void*) retaddr);
     h = Hash_retaddr(retaddr);
     while(1) {
       d = caml_frame_descriptors[h];
       if (d->retaddr == retaddr) break;
       h = (h+1) & caml_frame_descriptors_mask;
     }
    
  •  fprintf(stderr, "roots: d=%p ", d);
     if (d->frame_size != 0xFFFF) {
    
  •    fprintf(stderr, "frame_size=%ul num_live=%d\n", d->frame_size, d->num_live);
       /* Scan the roots in this frame */
       for (p = d->live_ofs, n = d->num_live; n > 0; n--, p++) {
         ofs = *p;
    

@@ -218,6 +222,7 @@
/* This marks the top of a stack chunk for an ML callback.
Skip C portion of stack and continue with next ML stack chunk. */
struct caml_context * next_context = Callback_link(sp);

  •    fprintf(stderr, "skipping C portion of stack\n");
       sp = next_context->bottom_of_stack;
       retaddr = next_context->last_retaddr;
       regs = next_context->gc_regs;
    

@@ -226,6 +231,7 @@
}
}
}

  • fprintf(stderr, "roots: finished cycle\n");
    /* Local C roots */
    for (lr = caml_local_roots; lr != NULL; lr = lr->next) {
    for (i = 0; i < lr->ntables; i++){

@vicuna
Copy link
Author

vicuna commented Aug 6, 2014

Comment author: @mshinwell

Sorry, and also: having got those lines of output, please decode the "retaddr" addresses. A quick way to decode such an address should be:

gdb /my/executable -ex "inf sym 0x12345678" -ex "quit"

@vicuna
Copy link
Author

vicuna commented Aug 8, 2014

Comment author: @mshinwell

Fixed in trunk and 4.02.

[caml_bottom_of_stack] was 16 bytes too low only when the GC was entered via the assembly [caml_alloc] veneers. This occurred as a result of the stack pointer being recorded after the two spills in those veneers. The upshot was that the return address of the previous frame was being read from the wrong stack slot.

Ugh.

(Richard Jones kindly compiled Coq on a real aarch64 machine, and ran the tests, with an earlier version of this patch that I am confident is semantically identical to the one committed.)

@vicuna
Copy link
Author

vicuna commented Aug 8, 2014

Comment author: @mshinwell

Oh, and, I wonder if we can optimize those caml_alloc sequences by not saving x29 prior to Lcaml_call_gc. We can think about that later, after 4.02.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants