MantisBT - OCaml
View Issue Details
0007452OCamlback end (clambda to assembly)public2017-01-06 03:392017-06-09 19:02
enguerrand 
shinwell 
normalcrashsometimes
closedno change required 
LinuxDebian
4.03.0 
later 
0007452: Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled
While switching a 4.02.3 codebase to 4.03 recently, we stumbled upon some random crashes from the compiler, and more rarely, occurrences of bad assembly code being generated (which as failed to compile), or instruction being trapped at runtime while the compiler is running.

Those problems occurs on an OCaml source file generated using the Extprot library.

The problem doesn't seems to happen all the time.
Most of the time, the file will compile successfully, and if enough retries are given, the compiler will then crash, example of returns from dmesg after a few crashes:

[22241.838551] ocamlopt.opt[48175]: segfault at ffffffffffde7768 ip 000055f75e412e3c sp 00007ffc3ee31de0 error 7 in ocamlopt.opt[55f75e0b6000+613000]
[22985.879907] ocamlopt.opt[48221]: segfault at af8 ip 00005564455169bd sp 00007ffc9f36b130 error 4 in ocamlopt.opt[556445006000+613000]
[23936.341126] ocamlopt.opt[48306]: segfault at 5837 ip 00005641554a16c8 sp 00007ffe1278f8e0 error 4 in ocamlopt.opt[56415514a000+613000]
[25395.780978] ocamlopt.opt[48445]: segfault at ffffffffffde7608 ip 0000557e25ea5cf4 sp 00007ffc2eac79d0 error 5 in ocamlopt.opt[557e25b49000+613000]

Backtraces obtained for those crashes give us informations which doesn't seems to show always the same thing. Example backtraces can be found in the attached archive.

The compiler will more rarely generated an assembly file that as won't be able to compile:

/tmp/camlasmc92578.s: Assembler messages:
/tmp/camlasmc92578.s:1005308: Error: operand type mismatch for `add'

Where the line 1005308 is: add $2300, $5199

Or:

/tmp/camlasm601e1c.s: Assembler messages:
/tmp/camlasm601e1c.s:820172: Error: operand type mismatch for `or'

Where the line 820172 is: orq $139950828249720, %rax

We haven't noticed as of now any misbehaviour in a successfully compiled and running instance of this file, but the issue is still very new for us so we will be watching it closely.
The problem doesn't seems to happen all the time, at least it doesn't crash at every build. We sometimes don't witness the crash before 30 minutes of retries.

Steps to reproduce:

OCaml 4.03 and 4.04 has been witnessed as triggering the problem.
Sample file is attached as the test case used to reproduce the problem: Extprot library must be installed in order to compile the file, since it was generated using Extprot. (we use the latest version from Opam)

Test case can be found in the attachment (test.ml)

To reproduce:
Just compile this file, preferably in a loop, with this command:

while ocamlfind opt -c -g -bin-annot -ccopt -g -ccopt -O2 -ccopt -Wextra -ccopt '-Wstrict-overflow=5' -thread -w +a-4-40..42-44-45-48-58 -w -27-32 -package extprot test.ml -o test.cmx; do echo "ok"; done

- If the crash doesn't occur for some time, after it occured again at least once, the probability of the compiler crashing seems to be increasing

- Crash was witnessed running ocamlopt and ocamlopt.opt
No tags attached.
? report.tar.xz (86,172) 2017-01-06 03:39
https://caml.inria.fr/mantis/file_download.php?file_id=1658&type=bug
? opt.s (97,082) 2017-01-09 12:35
https://caml.inria.fr/mantis/file_download.php?file_id=1660&type=bug
? unopt.s (100,286) 2017-01-09 12:36
https://caml.inria.fr/mantis/file_download.php?file_id=1661&type=bug
patch opt.patch (1,260) 2017-01-09 12:37
https://caml.inria.fr/mantis/file_download.php?file_id=1662&type=bug
? major_gc_clang3.9_O1.s (51,661) 2017-01-09 12:38
https://caml.inria.fr/mantis/file_download.php?file_id=1663&type=bug
? major_gc_clang3.9_O2.s (58,487) 2017-01-09 12:39
https://caml.inria.fr/mantis/file_download.php?file_id=1664&type=bug
diff test.diff (1,443) 2017-01-10 05:07
https://caml.inria.fr/mantis/file_download.php?file_id=1666&type=bug
patch major_gc_without_cast_o2.patch (31,524) 2017-01-11 10:10
https://caml.inria.fr/mantis/file_download.php?file_id=1668&type=bug
? major_gc_with_intnat_cast_o2.s (38,175) 2017-01-11 10:11
https://caml.inria.fr/mantis/file_download.php?file_id=1669&type=bug
Issue History
2017-01-06 03:39enguerrandNew Issue
2017-01-06 03:39enguerrandFile Added: report.tar.xz
2017-01-06 04:38jorisNote Added: 0017080
2017-01-06 10:46shinwellNote Added: 0017083
2017-01-06 10:46shinwellAssigned To => shinwell
2017-01-06 10:46shinwellStatusnew => assigned
2017-01-06 11:00enguerrandNote Added: 0017084
2017-01-06 11:13shinwellNote Added: 0017085
2017-01-06 11:40enguerrandNote Added: 0017086
2017-01-06 12:03shinwellNote Added: 0017087
2017-01-06 12:16shinwellNote Added: 0017088
2017-01-06 12:22shinwellNote Added: 0017089
2017-01-06 12:28shinwellNote Added: 0017090
2017-01-06 12:30jorisNote Added: 0017091
2017-01-06 12:56jorisNote Added: 0017092
2017-01-06 13:01shinwellNote Added: 0017093
2017-01-06 13:02shinwellNote Added: 0017094
2017-01-06 13:10shinwellNote Added: 0017095
2017-01-06 13:17enguerrandNote Added: 0017096
2017-01-06 13:33shinwellNote Added: 0017097
2017-01-06 14:23shinwellNote Added: 0017098
2017-01-06 14:29frischNote Added: 0017099
2017-01-06 16:15enguerrandNote Added: 0017100
2017-01-06 16:55ArmaelNote Added: 0017101
2017-01-06 16:56ArmaelNote Edited: 0017101bug_revision_view_page.php?bugnote_id=17101#r2832
2017-01-06 17:02ArmaelNote Added: 0017102
2017-01-06 17:31ArmaelNote Added: 0017103
2017-01-06 17:33ArmaelNote Edited: 0017103bug_revision_view_page.php?bugnote_id=17103#r2834
2017-01-06 17:54enguerrandNote Added: 0017104
2017-01-07 11:45jorisNote Added: 0017107
2017-01-07 11:46jorisNote Edited: 0017107bug_revision_view_page.php?bugnote_id=17107#r2836
2017-01-07 11:46jorisNote Edited: 0017107bug_revision_view_page.php?bugnote_id=17107#r2837
2017-01-07 14:36enguerrandNote Added: 0017108
2017-01-07 15:36xleroyNote Added: 0017109
2017-01-08 04:52jorisNote Added: 0017110
2017-01-08 07:25jorisNote Added: 0017111
2017-01-08 10:04xleroyNote Added: 0017112
2017-01-08 10:56cullmannNote Added: 0017114
2017-01-08 12:44jorisNote Added: 0017115
2017-01-08 12:45jorisNote Edited: 0017115bug_revision_view_page.php?bugnote_id=17115#r2839
2017-01-08 12:54jorisNote Edited: 0017115bug_revision_view_page.php?bugnote_id=17115#r2840
2017-01-08 13:59cullmannNote Added: 0017116
2017-01-08 15:14jorisNote Added: 0017117
2017-01-08 15:21cullmannNote Added: 0017118
2017-01-09 08:01shinwellNote Added: 0017121
2017-01-09 09:40shinwellNote Added: 0017122
2017-01-09 09:42shinwellNote Added: 0017123
2017-01-09 10:05shinwellNote Added: 0017124
2017-01-09 10:12enguerrandNote Added: 0017125
2017-01-09 10:59jorisNote Added: 0017126
2017-01-09 11:02cullmannNote Added: 0017127
2017-01-09 11:16shinwellNote Added: 0017128
2017-01-09 12:02jorisNote Added: 0017129
2017-01-09 12:03jorisNote Edited: 0017129bug_revision_view_page.php?bugnote_id=17129#r2842
2017-01-09 12:18jorisNote Added: 0017130
2017-01-09 12:19jorisNote Edited: 0017130bug_revision_view_page.php?bugnote_id=17130#r2844
2017-01-09 12:35jorisFile Added: opt.s
2017-01-09 12:36jorisFile Added: unopt.s
2017-01-09 12:37jorisFile Added: opt.patch
2017-01-09 12:38enguerrandFile Added: major_gc_clang3.9_O1.s
2017-01-09 12:39enguerrandFile Added: major_gc_clang3.9_O2.s
2017-01-09 12:40jorisNote Added: 0017131
2017-01-09 12:42enguerrandNote Added: 0017132
2017-01-09 13:09shinwellNote Added: 0017133
2017-01-09 17:55cullmannNote Added: 0017135
2017-01-10 05:07jorisFile Added: test.diff
2017-01-10 05:09jorisNote Added: 0017138
2017-01-10 11:36jorisNote Added: 0017140
2017-01-10 11:36jorisNote Edited: 0017140bug_revision_view_page.php?bugnote_id=17140#r2850
2017-01-10 19:12xleroyNote Added: 0017143
2017-01-11 09:54shinwellNote Added: 0017146
2017-01-11 10:10jorisFile Added: major_gc_without_cast_o2.patch
2017-01-11 10:10frischNote Added: 0017147
2017-01-11 10:11jorisFile Added: major_gc_with_intnat_cast_o2.s
2017-01-11 10:15jorisNote Added: 0017148
2017-01-11 10:37jorisNote Added: 0017150
2017-01-11 12:40schommerNote Added: 0017151
2017-01-12 01:25jorisNote Added: 0017154
2017-01-12 07:58jorisNote Added: 0017155
2017-01-25 15:31shinwellNote Added: 0017191
2017-01-25 16:07xleroyNote Added: 0017192
2017-01-26 10:13xleroyNote Added: 0017196
2017-01-26 10:13xleroyNote Edited: 0017196bug_revision_view_page.php?bugnote_id=17196#r2856
2017-02-19 16:46xleroyTarget Version => later
2017-02-19 16:46xleroyAdditional Information Updatedbug_revision_view_page.php?rev_id=2885#r2885
2017-02-23 16:35doligezCategoryOCaml backend (code generation) => Back end (clambda to assembly)
2017-02-23 16:44doligezCategoryBack end (clambda to assembly) => back end (clambda to assembly)
2017-03-10 11:24shinwellNote Added: 0017615
2017-05-26 20:42ygrekNote Added: 0017815
2017-05-27 03:33jorisNote Added: 0017816
2017-05-29 03:28jorisNote Added: 0017824
2017-05-29 07:54shinwellNote Added: 0017825
2017-05-29 10:27shinwellNote Added: 0017826
2017-05-29 10:55shinwellNote Added: 0017827
2017-05-29 14:48xleroyNote Added: 0017828
2017-05-29 14:52xleroyNote Edited: 0017828bug_revision_view_page.php?bugnote_id=17828#r3018
2017-05-31 10:30shinwellNote Added: 0017834
2017-05-31 14:04xleroyNote Edited: 0017828bug_revision_view_page.php?bugnote_id=17828#r3021
2017-05-31 14:06xleroySummaryCompiler segfault on large generated source code => Compiler segfault on Intel Skylake / Kaby Lake processors with hyperthreading enabled [was: on large generated source code]
2017-06-02 03:10jorisNote Added: 0017837
2017-06-02 03:11jorisNote Edited: 0017837bug_revision_view_page.php?bugnote_id=17837#r3023
2017-06-02 11:27shinwellSummaryCompiler segfault on Intel Skylake / Kaby Lake processors with hyperthreading enabled [was: on large generated source code] => Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled
2017-06-02 11:28shinwellNote Added: 0017838
2017-06-09 19:02shinwellNote Added: 0017863
2017-06-09 19:02shinwellStatusassigned => closed
2017-06-09 19:02shinwellResolutionopen => no change required

Notes
(0017080)
joris   
2017-01-06 04:38   
Traces are almost useless, but the memory corruption happens as frequently in byte code, with similar traces, making the runtime crash in GC or in runtime:

corrupted stack:

 Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055555557d336 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:298
298 accu = sp[0]; Next;
(gdb) bt
#0 0x000055555557d336 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:298
Backtrace stopped: Cannot access memory at address 0x7fffffffda68

Corrupted heap:

#0 0x000055687aa72534 in mark_slice_darken (slice_pointers=<synthetic pointer>,
    in_ephemeron=0, i=11803660, v=140605172023200, gray_vals_ptr=0x55687c7fead8)
    at major_gc.c:237
#1 mark_slice (work=853950) at major_gc.c:407
#2 0x000055687aa73384 in caml_major_collection_slice (howmuch=howmuch@entry=-1)
    at major_gc.c:767
0000003 0x000055687aa743ff in caml_gc_dispatch () at minor_gc.c:458
0000004 0x000055687aa8b807 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:673
0000005 0x000055687aa8cdf9 in caml_main (argv=0x7ffe02927368) at startup.c:374
0000006 0x000055687aa71b1c in main (argc=<optimized out>, argv=<optimized out>)
    at main.c:35

this looks like a buffer overrun while walking a large block, but the block is strange as far as i can tell:

(gdb) p (unsigned char)*(((uint64_t *)v) - 1)
$40 = 200 '\310'
(gdb) p *(((uint64_t *)v) - 1) >> 10
$45 = 137309738338
(0017083)
shinwell   
2017-01-06 10:46   
I will see if I can reproduce this.

In the meantime, have you reproduced this failure on more than one machine?
(0017084)
enguerrand   
2017-01-06 11:00   
Reproduced it on a few differents machines (all running Debian 64bits), either VM or physical.
Still haven't tried under something else than Linux, I will check if I can reproduce under OSX
(0017085)
shinwell   
2017-01-06 11:13   
Some subset (although not all) of the symptoms exhibited in this report kind of look like stack overflow. To rule out such problems, can you try to reproduce this having run "ulimit -s unlimited" first? (I'm also trying to reproduce it but don't know yet whether I will be successful in doing so.)
(0017086)
enguerrand   
2017-01-06 11:40   
We suspected stack overflow too and we tried to reproduce with a very large stack size and unlimited, and crashes still happened. I tried again just now just to be sure and the result is still the same
(0017087)
shinwell   
2017-01-06 12:03   
Right, that's what I was expecting.
(0017088)
shinwell   
2017-01-06 12:16   
I wonder if it's because the parameter called "i" of the function mark_slice_darken is of type "int". I think it should be of type "mlsize_t" since it's a field index. I wouldn't be surprised if this gargantuan source file produces a block that has sufficiently many fields for "i" to overflow at the moment.

Can you try changing that in your compiler tree (in 4.04 it's in byterun/major.gc line 232) and seeing if the problem goes away? I haven't managed to reproduce it yet.
(0017089)
shinwell   
2017-01-06 12:22   
(I will produce a GPR once you confirm)
(0017090)
shinwell   
2017-01-06 12:28   
Although actually, if that were to be the bug, I think ocamlopt.opt would use more memory than it does when compiling the file. So that might not be it---but I think it's wrong in any case.
(0017091)
joris   
2017-01-06 12:30   
Indeed. I've launch some test to try but since this is scanning a in memory block and since ocamlopt never use more than 1.5G RES in this case i doubt it can overflow i
(0017092)
joris   
2017-01-06 12:56   
Reproduced with i changed to mlsize_t, crash in ml_mark_slice, but i don't have many more info i forgot to build the runtime with -g this time
(0017093)
shinwell   
2017-01-06 13:01   
Given you seem to be able to reproduce this easily, can you try to get a valgrind trace?
(0017094)
shinwell   
2017-01-06 13:02   
(Please build the runtime with -g before doing that, in case it gives any more info)
(0017095)
shinwell   
2017-01-06 13:10   
In fact, another thing to try: please adjust the compiler Makefile so that it uses the debug runtime (you need a configure flag, and then I think it's "-runtime-variant d" for compile/link). This might pick something up as it enables a lot of assertions.
(0017096)
enguerrand   
2017-01-06 13:17   
I will give Valgrind a bit later.
It's not that easy to reproduce even on our end, we sometimes wait 30 minutes with 5 concurrent loops to see it happen, its kind of erratic
(0017097)
shinwell   
2017-01-06 13:33   
OK, well maybe try the debug runtime first then, since that's probably going to be a bit easier to set up.
(0017098)
shinwell   
2017-01-06 14:23   
It's possible this is something related to the change between 4.02 and 4.03 that allowed the major GC to stop scanning in the middle of blocks and defer the remaining fields until later.

I think it would be worth trying to disable this, since you do seem to have some large blocks (one of the backtraces shows a block with > 5 million fields). This should be achievable by changing byterun/major_gc.c, line 403 (in 4.04): this currently reads "end = size < end ? size : end;" and I think you should change it to "end = size".
(0017099)
frisch   
2017-01-06 14:29   
Could this be related to 0007228 / GPR#546 ?
(0017100)
enguerrand   
2017-01-06 16:15   
We tried to reproduce the issue on some server to ease the compilation time and we noticed that we couldn't reproduce the issue (as of now after running multiple loops for one hour or so).
We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.
Does someone have such a machine available to test this hypothesis ?
(0017101)
Armael   
2017-01-06 16:55   
(edited on: 2017-01-06 16:56)
I can reproduce the bug on my laptop, which has a Skylake CPU (i7-6600U).

A single instance of the loop ran without crashing for around 20 minutes. However, shortly after adding 3 more instances in parallel, two of them crashed with "ocamlopt.opt got signal and exited". The one which was running from the start also crashed quickly after that, with "Fatal error: exception File "utils/timings.ml", line 86, characters 27-33: Assertion failed".

Last relevant lines in dmesg are:

[81744.710293] ocamlopt.opt[348]: segfault at 7fe040f1b000 ip 00000000006b1b04 sp 00007fff9f242bb0 error 4 in ocamlopt.opt[400000+36b000]
[81762.064912] ocamlopt.opt[359]: segfault at fffffffffffffff9 ip 00000000004ba582 sp 00007ffd32705ea0 error 5 in ocamlopt.opt[400000+36b000]

(0017102)
Armael   
2017-01-06 17:02   
I just realized I was running all the parallel instances of the compiler in the same directory, on the same source file. So that's probably the cause of the assertion failure.
(0017103)
Armael   
2017-01-06 17:31   
(edited on: 2017-01-06 17:33)
Just to be sure, I tried with 4 instances of the loop in 4 separate directories. 3 segfaulted almost instantly (after around 15 seconds):

[84373.677406] ocamlopt.opt[1293]: segfault at ffffffffffde530b ip 00000000004b2b34 sp 00007ffd998b1280 error 7 in ocamlopt.opt[400000+36b000]
[84404.445402] ocamlopt.opt[1315]: segfault at b18 ip 00000000004c45a7 sp 00007ffcf29d0fc0 error 4 in ocamlopt.opt[400000+36b000]
[84412.083689] ocamlopt.opt[1325]: segfault at 7f6cb22c3000 ip 00000000006b1b04 sp 00007ffed67919e0 error 4 in ocamlopt.opt[400000+36b000]

The last one finally crashed 2 minutes after:
[84471.527980] ocamlopt.opt[1298]: segfault at 469b61 ip 00007f3485db380b sp 00007ffce2ad6018 error 7 in libc-2.24.so[7f3485c8b000+195000]

(0017104)
enguerrand   
2017-01-06 17:54   
joris mentioned the possibility that it might be related to the runtime being compiled with -O2 instead of -O1 since 4.03, testing 4.03 with -O1 might be interesting too
(0017107)
joris   
2017-01-07 11:45   
(edited on: 2017-01-07 11:46)
I indeed cannot reproduce after 1h with a runtime built with -O1, while i can reproduce in less than 10 minutes with an -O2 runtime (built with gcc 6.2)
This smells like an UB which is wrongly optimized by gcc and causing issues on newer intel CPUs.
I also tried to build the runtime with -fsanitize=address and -fsanitize=undefined. Interestingly i fail to reproduce with this binary, but it is also not reporting interesting things besides the caml_stat_alloc "leak" and some unaligned reads at startup in roots.c (which should be harmless on x86). This is testing on a i7 6700 (no K) Skylake

(0017108)
enguerrand   
2017-01-07 14:36   
I confirm that after compiling the runtime with -O1 I cannot reproduce after a few hours of retries. (gcc 6.2 too)
(0017109)
xleroy   
2017-01-07 15:36   
> We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.

Last Spring, another OCaml (industrial) user reported mysterious semireproducible crashes of a big ocamlopt-compiled program. The crashes would occur only on Skylake processors, and only in the presence of hyperthreading.

If it is possible for you, it would be interesting to turn hyperthreading off (in the BIOS) and try to reproduce again.
(0017110)
joris   
2017-01-08 04:52   
Meh. I cannot reproduce with HT disable indeed after one hour of 4 loops running concurrently...
(0017111)
joris   
2017-01-08 07:25   
For the record I upgraded the uefi firmware and the Intel microcode to latest version it makes no difference.
(0017112)
xleroy   
2017-01-08 10:04   
Thanks for the quick re-test without hyperthreading. This story is consistent with what was observed at the other industrial user in May. Based on their observations and those in this PR, the problem lies in the combination of:
- Skylake
- Hyperthreading
- OCaml 4.03
- compiled with gcc -O2

Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy. On the other hand, there was already one documented hardware issue with hyperthreading and Skylake: http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/ [^]
(0017114)
cullmann   
2017-01-08 10:56   
To solve our issue in May, we went over from using clang instead of GCC as the base compiler for OCaml and the other parts of our toolchain.

Since that switch, no such random crash came up (and we have one Skylake machine that runs longer regression tests in an endless loop the whole year, nothing to be seen after clang usage, daily crashs before)

The question is if it is feasible for you to:

a) try clang, too
b) if that works, try to search for the difference in e.g. the produced assembly
(0017115)
joris   
2017-01-08 12:44   
(edited on: 2017-01-08 12:54)
Whether an hardware bug is a crazy assumption, i don't know though indeed the HT condition is suspicious (Although it could have some other side effects like reducing cache pressure etc).

Cullmann thank you for your feedback, I will try clang later.

I have another piece of information to add : the problem seems to lie in one of the compiled symbols from the major gc. I tried to "bisect" the problematic files by building all runtime files in O2 one by one. Since the crashes seemed to happen during a major marking most of the time i started with major_gc.c and :

 - I can reproduce with runtime built with -O1 and only major_gc.c built with -O2
 - I cannot reproduce with runtime built with -O2 and and only major_gc.c built with -O1
 - I used minor_gc.c and memory.c as test samples and replicated the same test conditions. They behave "as expected", that is do not modify the result (crash when the other files are in O2, cannot reproduce when the other files are in O1)

(0017116)
cullmann   
2017-01-08 13:59   
Btw., I just tried your report.tar.xz, on my home machine, a Skylake Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, HT on.

4 parallel runs (in different directories) =>

/tmp/camlasm83d2dc.s: Assembler messages:
/tmp/camlasm83d2dc.s:315401: Error: operand type mismatch for `add'
File "test.ml", line 1:
Error: Assembler error, input left in file /tmp/camlasm83d2dc.s

in one of the endless loops after some hours.

(I used opam to get ocaml and the needed ext... lib, 4.03.0)
(0017117)
joris   
2017-01-08 15:14   
Thanks for testing. Hours is a long time, for my last test i don't wait for more than one hour before deciding the bug does not reproduce. I hope it does not invalidate my results. On the other hand, for cases where the bug is triggers it happens almost immediately for me, less than 2 minutes, and after that it happens often.

Anyway, following advises on IRC i refined the bisection logic at function level. It looks like the bug only happens when both sweep_slice and caml_major_collection_slice are optimized with O2 using __attribute__ ((optimize(2, "no-strict-aliasing")))

Diffing the assembly is starting to be far beyond my skill set, but the obvious difference is that it uses SIMD instructions and xmm registers at O2
(0017118)
cullmann   
2017-01-08 15:21   
Interesting, could you attach the assemblies O1/O2 for the two functions?
Perhaps others can derive more from it.
(0017121)
shinwell   
2017-01-09 08:01   
I will produce and have a look at the diff of the assembly code for those functions this morning.
(0017122)
shinwell   
2017-01-09 09:40   
@jorris @enguerrand Can you confirm which version of the microcode you are running on systems which exhibit this bug? It should be listed in /proc/cpuinfo on Linux.

Could you also find out whether it would be possible to update the microcode on one of these systems to the very latest version? This might be doable by updating your BIOS (depending on whether your motherboard manufacturer has released a BIOS containing the updated microcode, which at least one apparently has) or else by using some distribution-specific mechanism.

The Intel errata says that the Skylake bug that causes freezes can also cause "undefined system behaviour". I haven't managed to find any reference on Google to an instance of this bug that did something other than freezing. However it looks like we might have such an instance ourselves here. It would be very interesting to determine whether the microcode update solves the problem. If it does not I think I will file a bug with Intel.
(0017123)
shinwell   
2017-01-09 09:42   
Sorry, I wasn't very explicit in that last post: the freezing erratum has apparently been fixed by the latest microcode (according to Intel) although there appears to be some suggestion in the last comment on <https://communities.intel.com/thread/96157?start=30&tstart=0> [^] that some problem may still persist.
(0017124)
shinwell   
2017-01-09 10:05   
(I missed @joris's post saying that he'd tried this microcode update on one box, but we still need to confirm what version number that was)
(0017125)
enguerrand   
2017-01-09 10:12   
Two macbooks encountering the issues seems to be using the latest microcode available, cpu infos:

> sysctl -a | grep machdep.cpu
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.tsc_ccc.numerator: 226
machdep.cpu.thread_count: 8
machdep.cpu.core_count: 4
machdep.cpu.address_bits.virtual: 48
machdep.cpu.address_bits.physical: 39
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.cache.size: 256
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.linesize: 64
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 4
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.sensor: 1
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.processor_flag: 5
machdep.cpu.microcode_version: 174
machdep.cpu.cores_per_package: 8
machdep.cpu.logical_per_package: 16
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.brand: 0
machdep.cpu.signature: 329443
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.leaf7_feature_bits: 43806655
machdep.cpu.feature_bits: 9221960262849657855
machdep.cpu.stepping: 3
machdep.cpu.extfamily: 0
machdep.cpu.extmodel: 5
machdep.cpu.model: 94
machdep.cpu.family: 6
machdep.cpu.brand_string: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
machdep.cpu.vendor: GenuineIntel
machdep.cpu.max_ext: 2147483656
machdep.cpu.max_basic: 22
(0017126)
joris   
2017-01-09 10:59   
and on my office machine microcode 0x76, which contains the fix. I upgraded to latest version at home which i cannot give right now. In any case i find it suspicious that it could be skylake erratum 121 because it involves AVX. The only thing which can add AVX to ocamlopt is glibc, i can try to build statically to confirm without avx if you want.
Also, i observed nearly one hundred crashes by now and none of them leaked out of the process address space, they didn't affect any other system component.
I found something interesting i think, might report this in an hour once i'm confident enough in the results
(0017127)
cullmann   
2017-01-09 11:02   
We did all microcode updates, too, and we still got the crash, and the same behavior, all other things work on the machines (and we have some) beside OCaml, which is strange, as other things use much more AVX, like the glibc things and we did extensive prime95 runs on some machines which lead to no issues.
(0017128)
shinwell   
2017-01-09 11:16   
I was suspecting it might be related to erratum SKL082, not SKL121. I haven't managed to find anything very definitive about what triggers 082, but it sounds like it's probably floating point code (not necessarily AVX). See page 47 of:

http://www.intel.co.uk/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf [^]
(0017129)
joris   
2017-01-09 12:02   
(edited on: 2017-01-09 12:03)
I spent some time looking at the set of enabled optimizations at -O2, and checking for interesting one, then trying to enable/disable looking at what impact it had on assembly. After refining a set of interesting optimization i performed an exhaustive check, and it appears Value Range Propagation is enough to trigger the issue.

It appears that :

a) when runtime is built with -O2 -fno-tree-vrp the crash does not manifest after one hour of testing
b) when runtime is built with -O2 and sweep_slice and major_collection_slice are built in O2 with no-tree-vrp the crash does not manifest
c) building the runtime in -O1 with sweep_slice and major_collection_slice built in O1 -ftree-vrp is enough to reproduce.

Could it be some range check bound assumptions being wrongly propagated down to the sweep phase ? i didn't spot anything obvious yet, i will upload assembly diff.

I only noticed that at major_gc.c:764 (4.03 branch)
    computed_work = (intnat) (p * caml_stat_heap_wsz * 5 / 3);

there is no cast to double and this branch is taken in sweep_slice. But it should be harmless as far as i know since p is a double and i fail to see what problem it might cause later.

(0017130)
joris   
2017-01-09 12:18   
(edited on: 2017-01-09 12:19)
I just realized that this is line is a small variation of the issue Alain Frisch posted two days ago...

(0017131)
joris   
2017-01-09 12:40   
I attached :
 - opt.patch, a diff on top of ocaml 4.03 (2449d2fc7) which is enough to trigger the bug on top of -O1
 - unopt.s the assembly generated by gcc for major_gc.c in -O1
 - opt.s the assembly generated by gcc for major_gc.c with opt.patch (bad binary)
(0017132)
enguerrand   
2017-01-09 12:42   
I also attached (even though I can't reproduce under Clang 3.9):
- major_gc_clang3.9_O1.s : major_gc.c compiled on top of 4.03.0 with Clang 3.9 with the -O1 optimisation flag
- major_gc_clang3.9_O2.s : major_gc.c compiled on top of 4.03.0 with Clang 3.9 with the -O2 optimisation flag
(0017133)
shinwell   
2017-01-09 13:09   
I am going to make some investigations and report back.
(0017135)
cullmann   
2017-01-09 17:55   
For the record, my machine which can reproduce the crash has that stepping + microcode:

cullmann@neko:~> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
stepping : 3
microcode : 0x9e
(0017138)
joris   
2017-01-10 05:09   
test.diff attached, containing diff between a good and bad build for a dump of gcc "tree" internal representation (sort of ssa it looks like). Now this is would lead to rule out bug in ocaml because the change is really minimal. It appears it just shortcut testing the loop condition in sweep_slice when the else branch is taken and work has not been updated. That appears to be the only change
(0017140)
joris   
2017-01-10 11:36   
I believe i have found something interesting. At some point i did a careful review of sweep_slice function and i noticed this line:

      work -= Whsize_hd (hd); (major_gc.c:545)

This macro returns an unsized int because hd is header_t. If i understand C standard correctly it means that work -= size is similar to work = work - size, and the substraction operands will be promoted to unsigned long.

I checked gcc tree SSA dump and it indeed looks like this is what GCC is doing. I tried to replace this line with

      work = work - (intnat) (Whisze_hd (hd));

It does indeed make some difference in SSA tree (just add a cast and properly execute the arithmetic substraction with signed temporaries), but i must admit i fail to understand why it would cause a segfault in this case since generated assembly used signed instruction (notq then addq). It does make some difference in assembly though, the loop condition is reversed:
it uses testq + jle instead of testq + jg, and it uses an additional register. I do not really understand what changes it makes.

That being said i'm trying to reproduce this bug for 2hours and it has not crashed in O2 with this change yet

(0017143)
xleroy   
2017-01-10 19:12   
I admire your ingenuity in searching for GCC miscompilation issues or undefined behaviors in OCaml's sources. Yet, those issues would produce reproducible crashes, which is not the case here. Also, they would not account for the fact that crashes are observed only with Skylake and hyperthreading.
(0017146)
shinwell   
2017-01-11 09:54   
I am pursuing independently the possibility of this being a CPU bug.
I agree with what @xleroy said, although having a smaller diff of the assembly code for major_gc.c than you had before would be useful; it looks like you may now be able to obtain that.
(0017147)
frisch   
2017-01-11 10:10   
> reproducible crashes

Couldn't there be undefined behaviors at he CPU level, which would lead to non-reproducible situations depending e.g. on physical memory addresses?

Also, it is hard to get a fully reproducible behavior at the OCaml level itself. Simply getting the current pid and printing it to a string (e.g. for logging purpose) can lead to different allocation scheme of the program (depending on the length of the printed pid).
(0017148)
joris   
2017-01-11 10:15   
Honestly, i have found this and i spent some time looking at the assembly produced but it makes no sense to me why it would behave differently. Still, please find attached :

 -major_gc_with_intnat_cast_o2.s built with gcc -O2 but with the previously described (intnat) cast. It has not crashed in 17 hours.
 - major_gc_without_cast_o2.patch the diff between this and the assembly produced by O2 without the patch, which crash.

As far as i can tell it just aligned the .text of the hot loop and use an additional temp register and the operands of work -= Whsize_hd(hd) are reversed. Besides that...
(0017150)
joris   
2017-01-11 10:37   
> reproducible crashes

btw we tried to disable ASLR with setarch x86_64 -R ocamlfind opt... but it didn't help.
Also it is interesting to note that all the crashes happen during marking, apparently visiting corrupted memory as block header. It could be that the problem is always here but we don't see it every time, because sometimes it does not crash but gas complains about invalid input, sometime gas warns and compilation is successful.
It could be that the others time there is not visible effects. I can try diffing the emitted assembly file
(0017151)
schommer   
2017-01-11 12:40   
During our investigation of the crashes I ran ocaml a few times with the undefined behavior sanitizer of clang, as well as clang-check and most warnings/errors reported were because of unaligned memory access.
(0017154)
joris   
2017-01-12 01:25   
So you might have to disregard basically everything i said. I kept the patched binary running in a loop and after respectively 28h and 32h two processes crashed... So it might just be some side effect affecting how often the bug is triggered.
(0017155)
joris   
2017-01-12 07:58   
Just to clarify the cast patch has crashed, -fno-free-vrp has not
(0017191)
shinwell   
2017-01-25 15:31   
I'd like to find out whether this problem manifests itself if the execution of the OCaml compiler is pinned to a particular processor core.

You can probably do this on Linux by using "taskset" as a wrapper around ocamlopt.opt or else by altering the runtime (e.g. in asmrun/startup.c) to call sched_setaffinity. Could you try this on a Skylake system to see if it makes the problem go away?

(As far as I understand it the problem has not been reproduced unless hyperthreading is enabled. If that is correct, hyperthreading should also be enabled for this experiment.)
(0017192)
xleroy   
2017-01-25 16:07   
Lucky me, my new workstation at Inria is a Skylake Xeon, 4 cores, 8 threads, so I could play with the original repro case.

Without setting processor affinities: it is easy to reproduce the crash (in a few minutes at most) by running at least 5 copies of the compilation task in parallel. With 4 copies the crash happens but takes much longer. With 3 copies or less I didn't observe it in a couple of hours.

By setting processor affinities, I see the crash in a few minutes with only two compilations run in parallel, provided they are mapped on the same physical core (e.g. logical cores 1 and 5 on my machine). Two parallel runs mapped to different physical cores (e.g. logical cores 1 and 2 on my machine) have been running for 1 hour already without a glitch, but I'll let them run overnight. Finally, I'm also trying two parallel runs mapped to the same logical processor, for reference. More results tomorrow.
(0017196)
xleroy   
2017-01-26 10:13   
More results from my overnight runs:
- Two runs in parallel, mapped to different physical cores: this crashed eventually but only after 6 hours or so. Note that the machine was not completely idle, so the other logical processors of those physical cores were probably used (lightly).
- Two runs in parallel mapped to the same logical processor: no crash after 12 hours.

(0017615)
shinwell   
2017-03-10 11:24   
As an update, this is still being investigated at Intel.
(0017815)
ygrek   
2017-05-26 20:42   
Any chance it is what intel microcode update talks about?

http://metadata.ftp-master.debian.org/changelogs/non-free/i/intel-microcode/intel-microcode_3.20170511.1_changelog [^]

      Likely fix nightmare-level Skylake erratum SKL150. Fortunately,
      either this erratum is very-low-hitting, or gcc/clang/icc/msvc
      won't usually issue the affected opcode pattern and it ends up
      being rare.
      SKL150 - Short loops using both the AH/BH/CH/DH registers and
      the corresponding wide register *may* result in unpredictable
      system behavior. Requires both logical processors of the same
      core (i.e. sibling hyperthreads) to be active to trigger, as
      well as a "complex set of micro-architectural conditions"
(0017816)
joris   
2017-05-27 03:33   
Everything in this description matches this issue. I will have to wait monday to test this though.
Thank you everyone for the time spent on this problem
(0017824)
joris   
2017-05-29 03:28   
microcode update appears to fix the crash (microcode version 0xba). I believe this issue can be closed
(0017825)
shinwell   
2017-05-29 07:54   
Interesting. I will see if I can get Intel to confirm this. Let's leave this issue open for the moment.
(0017826)
shinwell   
2017-05-29 10:27   
I looked at the code of sweep_slice, which was conjectured to be one of the functions affected (see above, and the attachment opt.s); indeed it appears that perhaps it might trigger the problem. There is a loop with fewer than 64 instructions using both the %ah register and %rax. The use of %ah is probably quite unusual, but GCC is generating it to deal with the GC tag bits inside a header word.
(0017827)
shinwell   
2017-05-29 10:55   
By the way the original Intel description is here, on the page numbered 65:

http://www.intel.co.uk/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf [^]
(0017828)
xleroy   
2017-05-29 14:48   
(edited on: 2017-05-31 14:04)
After updating the microcode on my Xeon E3 Skylake, the test that used to crash in a few minutes has been running for 6 hours without a hiccup. I'll leave it running for a few days, but it looks like the problem is nailed down and fixed.

Update: the test ran for 50 hours and produced no failures.

(0017834)
shinwell   
2017-05-31 10:30   
This problem also affects Kaby Lake systems (erratum KBL095); however, I'm unsure if a microcode fix has been released publicly for such systems. The best solution is to disable hyperthreading for the moment.

Yesterday Fred and I experimented on a Kaby Lake machine by changing the generated assembly from GCC for major_gc.c so that it didn't reference registers such as %ah. The problem was not reproducible after that change, whereas it was almost immediately reproducible before.

If there are no further developments by the start of next week, I think we can close this issue.
(0017837)
joris   
2017-06-02 03:10   
(edited on: 2017-06-02 03:11)
I see you changed the description which should help people searching for this issue in the future.

It should be noted that since the issue is triggered by the major gc, it's not only compiler. Any long running ocaml program has a high chance of triggering this, and it will not always crash. You can get corrupted data in memory and never crashing.

As an example we tried to deploy some tool on a large xeon skylake cluster, several hundred processes. They didn't crash in hours, but very quickly we saw corrupted data being sent over the network/written into the database.

So anyone reading this in the future, don't assume this is only compiler, and don't run critical code on skylake/kaby lake without updating the firmware if you don't want to end up in a nightmarish situation.

(0017838)
shinwell   
2017-06-02 11:28   
Agreed, I've updated the title of this issue.
(0017863)
shinwell   
2017-06-09 19:02   
Closing this issue as per the above.