Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007452OCamlback end (clambda to assembly)public2017-01-06 03:392017-06-09 19:02
Assigned Toshinwell 
StatusclosedResolutionno change required 
PlatformLinuxOSDebianOS Version
Product Version4.03.0 
Target VersionlaterFixed in Version 
Summary0007452: Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled
DescriptionWhile switching a 4.02.3 codebase to 4.03 recently, we stumbled upon some random crashes from the compiler, and more rarely, occurrences of bad assembly code being generated (which as failed to compile), or instruction being trapped at runtime while the compiler is running.

Those problems occurs on an OCaml source file generated using the Extprot library.

The problem doesn't seems to happen all the time.
Most of the time, the file will compile successfully, and if enough retries are given, the compiler will then crash, example of returns from dmesg after a few crashes:

[22241.838551] ocamlopt.opt[48175]: segfault at ffffffffffde7768 ip 000055f75e412e3c sp 00007ffc3ee31de0 error 7 in ocamlopt.opt[55f75e0b6000+613000]
[22985.879907] ocamlopt.opt[48221]: segfault at af8 ip 00005564455169bd sp 00007ffc9f36b130 error 4 in ocamlopt.opt[556445006000+613000]
[23936.341126] ocamlopt.opt[48306]: segfault at 5837 ip 00005641554a16c8 sp 00007ffe1278f8e0 error 4 in ocamlopt.opt[56415514a000+613000]
[25395.780978] ocamlopt.opt[48445]: segfault at ffffffffffde7608 ip 0000557e25ea5cf4 sp 00007ffc2eac79d0 error 5 in ocamlopt.opt[557e25b49000+613000]

Backtraces obtained for those crashes give us informations which doesn't seems to show always the same thing. Example backtraces can be found in the attached archive.

The compiler will more rarely generated an assembly file that as won't be able to compile:

/tmp/camlasmc92578.s: Assembler messages:
/tmp/camlasmc92578.s:1005308: Error: operand type mismatch for `add'

Where the line 1005308 is: add $2300, $5199


/tmp/camlasm601e1c.s: Assembler messages:
/tmp/camlasm601e1c.s:820172: Error: operand type mismatch for `or'

Where the line 820172 is: orq $139950828249720, %rax

We haven't noticed as of now any misbehaviour in a successfully compiled and running instance of this file, but the issue is still very new for us so we will be watching it closely.
Steps To ReproduceThe problem doesn't seems to happen all the time, at least it doesn't crash at every build. We sometimes don't witness the crash before 30 minutes of retries.

Steps to reproduce:

OCaml 4.03 and 4.04 has been witnessed as triggering the problem.
Sample file is attached as the test case used to reproduce the problem: Extprot library must be installed in order to compile the file, since it was generated using Extprot. (we use the latest version from Opam)

Test case can be found in the attachment (

To reproduce:
Just compile this file, preferably in a loop, with this command:

while ocamlfind opt -c -g -bin-annot -ccopt -g -ccopt -O2 -ccopt -Wextra -ccopt '-Wstrict-overflow=5' -thread -w +a-4-40..42-44-45-48-58 -w -27-32 -package extprot -o test.cmx; do echo "ok"; done

Additional Information- If the crash doesn't occur for some time, after it occured again at least once, the probability of the compiler crashing seems to be increasing

- Crash was witnessed running ocamlopt and ocamlopt.opt
TagsNo tags attached.
Attached Files? file icon report.tar.xz [^] (86,172 bytes) 2017-01-06 03:39
? file icon opt.s [^] (97,082 bytes) 2017-01-09 12:35
? file icon unopt.s [^] (100,286 bytes) 2017-01-09 12:36
patch file icon opt.patch [^] (1,260 bytes) 2017-01-09 12:37 [Show Content]
? file icon major_gc_clang3.9_O1.s [^] (51,661 bytes) 2017-01-09 12:38
? file icon major_gc_clang3.9_O2.s [^] (58,487 bytes) 2017-01-09 12:39
diff file icon test.diff [^] (1,443 bytes) 2017-01-10 05:07 [Show Content]
patch file icon major_gc_without_cast_o2.patch [^] (31,524 bytes) 2017-01-11 10:10 [Show Content]
? file icon major_gc_with_intnat_cast_o2.s [^] (38,175 bytes) 2017-01-11 10:11

- Relationships

-  Notes
joris (reporter)
2017-01-06 04:38

Traces are almost useless, but the memory corruption happens as frequently in byte code, with similar traces, making the runtime crash in GC or in runtime:

corrupted stack:

 Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055555557d336 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:298
298 accu = sp[0]; Next;
(gdb) bt
#0 0x000055555557d336 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:298
Backtrace stopped: Cannot access memory at address 0x7fffffffda68

Corrupted heap:

#0 0x000055687aa72534 in mark_slice_darken (slice_pointers=<synthetic pointer>,
    in_ephemeron=0, i=11803660, v=140605172023200, gray_vals_ptr=0x55687c7fead8)
    at major_gc.c:237
#1 mark_slice (work=853950) at major_gc.c:407
#2 0x000055687aa73384 in caml_major_collection_slice (howmuch=howmuch@entry=-1)
    at major_gc.c:767
0000003 0x000055687aa743ff in caml_gc_dispatch () at minor_gc.c:458
0000004 0x000055687aa8b807 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:673
0000005 0x000055687aa8cdf9 in caml_main (argv=0x7ffe02927368) at startup.c:374
0000006 0x000055687aa71b1c in main (argc=<optimized out>, argv=<optimized out>)
    at main.c:35

this looks like a buffer overrun while walking a large block, but the block is strange as far as i can tell:

(gdb) p (unsigned char)*(((uint64_t *)v) - 1)
$40 = 200 '\310'
(gdb) p *(((uint64_t *)v) - 1) >> 10
$45 = 137309738338
shinwell (developer)
2017-01-06 10:46

I will see if I can reproduce this.

In the meantime, have you reproduced this failure on more than one machine?
enguerrand (reporter)
2017-01-06 11:00

Reproduced it on a few differents machines (all running Debian 64bits), either VM or physical.
Still haven't tried under something else than Linux, I will check if I can reproduce under OSX
shinwell (developer)
2017-01-06 11:13

Some subset (although not all) of the symptoms exhibited in this report kind of look like stack overflow. To rule out such problems, can you try to reproduce this having run "ulimit -s unlimited" first? (I'm also trying to reproduce it but don't know yet whether I will be successful in doing so.)
enguerrand (reporter)
2017-01-06 11:40

We suspected stack overflow too and we tried to reproduce with a very large stack size and unlimited, and crashes still happened. I tried again just now just to be sure and the result is still the same
shinwell (developer)
2017-01-06 12:03

Right, that's what I was expecting.
shinwell (developer)
2017-01-06 12:16

I wonder if it's because the parameter called "i" of the function mark_slice_darken is of type "int". I think it should be of type "mlsize_t" since it's a field index. I wouldn't be surprised if this gargantuan source file produces a block that has sufficiently many fields for "i" to overflow at the moment.

Can you try changing that in your compiler tree (in 4.04 it's in byterun/major.gc line 232) and seeing if the problem goes away? I haven't managed to reproduce it yet.
shinwell (developer)
2017-01-06 12:22

(I will produce a GPR once you confirm)
shinwell (developer)
2017-01-06 12:28

Although actually, if that were to be the bug, I think ocamlopt.opt would use more memory than it does when compiling the file. So that might not be it---but I think it's wrong in any case.
joris (reporter)
2017-01-06 12:30

Indeed. I've launch some test to try but since this is scanning a in memory block and since ocamlopt never use more than 1.5G RES in this case i doubt it can overflow i
joris (reporter)
2017-01-06 12:56

Reproduced with i changed to mlsize_t, crash in ml_mark_slice, but i don't have many more info i forgot to build the runtime with -g this time
shinwell (developer)
2017-01-06 13:01

Given you seem to be able to reproduce this easily, can you try to get a valgrind trace?
shinwell (developer)
2017-01-06 13:02

(Please build the runtime with -g before doing that, in case it gives any more info)
shinwell (developer)
2017-01-06 13:10

In fact, another thing to try: please adjust the compiler Makefile so that it uses the debug runtime (you need a configure flag, and then I think it's "-runtime-variant d" for compile/link). This might pick something up as it enables a lot of assertions.
enguerrand (reporter)
2017-01-06 13:17

I will give Valgrind a bit later.
It's not that easy to reproduce even on our end, we sometimes wait 30 minutes with 5 concurrent loops to see it happen, its kind of erratic
shinwell (developer)
2017-01-06 13:33

OK, well maybe try the debug runtime first then, since that's probably going to be a bit easier to set up.
shinwell (developer)
2017-01-06 14:23

It's possible this is something related to the change between 4.02 and 4.03 that allowed the major GC to stop scanning in the middle of blocks and defer the remaining fields until later.

I think it would be worth trying to disable this, since you do seem to have some large blocks (one of the backtraces shows a block with > 5 million fields). This should be achievable by changing byterun/major_gc.c, line 403 (in 4.04): this currently reads "end = size < end ? size : end;" and I think you should change it to "end = size".
frisch (developer)
2017-01-06 14:29

Could this be related to 0007228 / GPR#546 ?
enguerrand (reporter)
2017-01-06 16:15

We tried to reproduce the issue on some server to ease the compilation time and we noticed that we couldn't reproduce the issue (as of now after running multiple loops for one hour or so).
We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.
Does someone have such a machine available to test this hypothesis ?
Armael (manager)
2017-01-06 16:55
edited on: 2017-01-06 16:56

I can reproduce the bug on my laptop, which has a Skylake CPU (i7-6600U).

A single instance of the loop ran without crashing for around 20 minutes. However, shortly after adding 3 more instances in parallel, two of them crashed with "ocamlopt.opt got signal and exited". The one which was running from the start also crashed quickly after that, with "Fatal error: exception File "utils/", line 86, characters 27-33: Assertion failed".

Last relevant lines in dmesg are:

[81744.710293] ocamlopt.opt[348]: segfault at 7fe040f1b000 ip 00000000006b1b04 sp 00007fff9f242bb0 error 4 in ocamlopt.opt[400000+36b000]
[81762.064912] ocamlopt.opt[359]: segfault at fffffffffffffff9 ip 00000000004ba582 sp 00007ffd32705ea0 error 5 in ocamlopt.opt[400000+36b000]

Armael (manager)
2017-01-06 17:02

I just realized I was running all the parallel instances of the compiler in the same directory, on the same source file. So that's probably the cause of the assertion failure.
Armael (manager)
2017-01-06 17:31
edited on: 2017-01-06 17:33

Just to be sure, I tried with 4 instances of the loop in 4 separate directories. 3 segfaulted almost instantly (after around 15 seconds):

[84373.677406] ocamlopt.opt[1293]: segfault at ffffffffffde530b ip 00000000004b2b34 sp 00007ffd998b1280 error 7 in ocamlopt.opt[400000+36b000]
[84404.445402] ocamlopt.opt[1315]: segfault at b18 ip 00000000004c45a7 sp 00007ffcf29d0fc0 error 4 in ocamlopt.opt[400000+36b000]
[84412.083689] ocamlopt.opt[1325]: segfault at 7f6cb22c3000 ip 00000000006b1b04 sp 00007ffed67919e0 error 4 in ocamlopt.opt[400000+36b000]

The last one finally crashed 2 minutes after:
[84471.527980] ocamlopt.opt[1298]: segfault at 469b61 ip 00007f3485db380b sp 00007ffce2ad6018 error 7 in[7f3485c8b000+195000]

enguerrand (reporter)
2017-01-06 17:54

joris mentioned the possibility that it might be related to the runtime being compiled with -O2 instead of -O1 since 4.03, testing 4.03 with -O1 might be interesting too
joris (reporter)
2017-01-07 11:45
edited on: 2017-01-07 11:46

I indeed cannot reproduce after 1h with a runtime built with -O1, while i can reproduce in less than 10 minutes with an -O2 runtime (built with gcc 6.2)
This smells like an UB which is wrongly optimized by gcc and causing issues on newer intel CPUs.
I also tried to build the runtime with -fsanitize=address and -fsanitize=undefined. Interestingly i fail to reproduce with this binary, but it is also not reporting interesting things besides the caml_stat_alloc "leak" and some unaligned reads at startup in roots.c (which should be harmless on x86). This is testing on a i7 6700 (no K) Skylake

enguerrand (reporter)
2017-01-07 14:36

I confirm that after compiling the runtime with -O1 I cannot reproduce after a few hours of retries. (gcc 6.2 too)
xleroy (administrator)
2017-01-07 15:36

> We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.

Last Spring, another OCaml (industrial) user reported mysterious semireproducible crashes of a big ocamlopt-compiled program. The crashes would occur only on Skylake processors, and only in the presence of hyperthreading.

If it is possible for you, it would be interesting to turn hyperthreading off (in the BIOS) and try to reproduce again.
joris (reporter)
2017-01-08 04:52

Meh. I cannot reproduce with HT disable indeed after one hour of 4 loops running concurrently...
joris (reporter)
2017-01-08 07:25

For the record I upgraded the uefi firmware and the Intel microcode to latest version it makes no difference.
xleroy (administrator)
2017-01-08 10:04

Thanks for the quick re-test without hyperthreading. This story is consistent with what was observed at the other industrial user in May. Based on their observations and those in this PR, the problem lies in the combination of:
- Skylake
- Hyperthreading
- OCaml 4.03
- compiled with gcc -O2

Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy. On the other hand, there was already one documented hardware issue with hyperthreading and Skylake: [^]
cullmann (reporter)
2017-01-08 10:56

To solve our issue in May, we went over from using clang instead of GCC as the base compiler for OCaml and the other parts of our toolchain.

Since that switch, no such random crash came up (and we have one Skylake machine that runs longer regression tests in an endless loop the whole year, nothing to be seen after clang usage, daily crashs before)

The question is if it is feasible for you to:

a) try clang, too
b) if that works, try to search for the difference in e.g. the produced assembly
joris (reporter)
2017-01-08 12:44
edited on: 2017-01-08 12:54

Whether an hardware bug is a crazy assumption, i don't know though indeed the HT condition is suspicious (Although it could have some other side effects like reducing cache pressure etc).

Cullmann thank you for your feedback, I will try clang later.

I have another piece of information to add : the problem seems to lie in one of the compiled symbols from the major gc. I tried to "bisect" the problematic files by building all runtime files in O2 one by one. Since the crashes seemed to happen during a major marking most of the time i started with major_gc.c and :

 - I can reproduce with runtime built with -O1 and only major_gc.c built with -O2
 - I cannot reproduce with runtime built with -O2 and and only major_gc.c built with -O1
 - I used minor_gc.c and memory.c as test samples and replicated the same test conditions. They behave "as expected", that is do not modify the result (crash when the other files are in O2, cannot reproduce when the other files are in O1)

cullmann (reporter)
2017-01-08 13:59

Btw., I just tried your report.tar.xz, on my home machine, a Skylake Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, HT on.

4 parallel runs (in different directories) =>

/tmp/camlasm83d2dc.s: Assembler messages:
/tmp/camlasm83d2dc.s:315401: Error: operand type mismatch for `add'
File "", line 1:
Error: Assembler error, input left in file /tmp/camlasm83d2dc.s

in one of the endless loops after some hours.

(I used opam to get ocaml and the needed ext... lib, 4.03.0)
joris (reporter)
2017-01-08 15:14

Thanks for testing. Hours is a long time, for my last test i don't wait for more than one hour before deciding the bug does not reproduce. I hope it does not invalidate my results. On the other hand, for cases where the bug is triggers it happens almost immediately for me, less than 2 minutes, and after that it happens often.

Anyway, following advises on IRC i refined the bisection logic at function level. It looks like the bug only happens when both sweep_slice and caml_major_collection_slice are optimized with O2 using __attribute__ ((optimize(2, "no-strict-aliasing")))

Diffing the assembly is starting to be far beyond my skill set, but the obvious difference is that it uses SIMD instructions and xmm registers at O2
cullmann (reporter)
2017-01-08 15:21

Interesting, could you attach the assemblies O1/O2 for the two functions?
Perhaps others can derive more from it.
shinwell (developer)
2017-01-09 08:01

I will produce and have a look at the diff of the assembly code for those functions this morning.
shinwell (developer)
2017-01-09 09:40

@jorris @enguerrand Can you confirm which version of the microcode you are running on systems which exhibit this bug? It should be listed in /proc/cpuinfo on Linux.

Could you also find out whether it would be possible to update the microcode on one of these systems to the very latest version? This might be doable by updating your BIOS (depending on whether your motherboard manufacturer has released a BIOS containing the updated microcode, which at least one apparently has) or else by using some distribution-specific mechanism.

The Intel errata says that the Skylake bug that causes freezes can also cause "undefined system behaviour". I haven't managed to find any reference on Google to an instance of this bug that did something other than freezing. However it looks like we might have such an instance ourselves here. It would be very interesting to determine whether the microcode update solves the problem. If it does not I think I will file a bug with Intel.
shinwell (developer)
2017-01-09 09:42

Sorry, I wasn't very explicit in that last post: the freezing erratum has apparently been fixed by the latest microcode (according to Intel) although there appears to be some suggestion in the last comment on <> [^] that some problem may still persist.
shinwell (developer)
2017-01-09 10:05

(I missed @joris's post saying that he'd tried this microcode update on one box, but we still need to confirm what version number that was)
enguerrand (reporter)
2017-01-09 10:12

Two macbooks encountering the issues seems to be using the latest microcode available, cpu infos:

> sysctl -a | grep machdep.cpu
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.tsc_ccc.numerator: 226
machdep.cpu.thread_count: 8
machdep.cpu.core_count: 4
machdep.cpu.address_bits.virtual: 48
machdep.cpu.address_bits.physical: 39 64 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.cache.size: 256
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.linesize: 64
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.arch_perf.fixed_number: 3 0
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 4
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.sensor: 1
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.processor_flag: 5
machdep.cpu.microcode_version: 174
machdep.cpu.cores_per_package: 8
machdep.cpu.logical_per_package: 16
machdep.cpu.brand: 0
machdep.cpu.signature: 329443
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.leaf7_feature_bits: 43806655
machdep.cpu.feature_bits: 9221960262849657855
machdep.cpu.stepping: 3
machdep.cpu.extfamily: 0
machdep.cpu.extmodel: 5
machdep.cpu.model: 94 6
machdep.cpu.brand_string: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
machdep.cpu.vendor: GenuineIntel
machdep.cpu.max_ext: 2147483656
machdep.cpu.max_basic: 22
joris (reporter)
2017-01-09 10:59

and on my office machine microcode 0x76, which contains the fix. I upgraded to latest version at home which i cannot give right now. In any case i find it suspicious that it could be skylake erratum 121 because it involves AVX. The only thing which can add AVX to ocamlopt is glibc, i can try to build statically to confirm without avx if you want.
Also, i observed nearly one hundred crashes by now and none of them leaked out of the process address space, they didn't affect any other system component.
I found something interesting i think, might report this in an hour once i'm confident enough in the results
cullmann (reporter)
2017-01-09 11:02

We did all microcode updates, too, and we still got the crash, and the same behavior, all other things work on the machines (and we have some) beside OCaml, which is strange, as other things use much more AVX, like the glibc things and we did extensive prime95 runs on some machines which lead to no issues.
shinwell (developer)
2017-01-09 11:16

I was suspecting it might be related to erratum SKL082, not SKL121. I haven't managed to find anything very definitive about what triggers 082, but it sounds like it's probably floating point code (not necessarily AVX). See page 47 of: [^]
joris (reporter)
2017-01-09 12:02
edited on: 2017-01-09 12:03

I spent some time looking at the set of enabled optimizations at -O2, and checking for interesting one, then trying to enable/disable looking at what impact it had on assembly. After refining a set of interesting optimization i performed an exhaustive check, and it appears Value Range Propagation is enough to trigger the issue.

It appears that :

a) when runtime is built with -O2 -fno-tree-vrp the crash does not manifest after one hour of testing
b) when runtime is built with -O2 and sweep_slice and major_collection_slice are built in O2 with no-tree-vrp the crash does not manifest
c) building the runtime in -O1 with sweep_slice and major_collection_slice built in O1 -ftree-vrp is enough to reproduce.

Could it be some range check bound assumptions being wrongly propagated down to the sweep phase ? i didn't spot anything obvious yet, i will upload assembly diff.

I only noticed that at major_gc.c:764 (4.03 branch)
    computed_work = (intnat) (p * caml_stat_heap_wsz * 5 / 3);

there is no cast to double and this branch is taken in sweep_slice. But it should be harmless as far as i know since p is a double and i fail to see what problem it might cause later.

joris (reporter)
2017-01-09 12:18
edited on: 2017-01-09 12:19

I just realized that this is line is a small variation of the issue Alain Frisch posted two days ago...

joris (reporter)
2017-01-09 12:40

I attached :
 - opt.patch, a diff on top of ocaml 4.03 (2449d2fc7) which is enough to trigger the bug on top of -O1
 - unopt.s the assembly generated by gcc for major_gc.c in -O1
 - opt.s the assembly generated by gcc for major_gc.c with opt.patch (bad binary)
enguerrand (reporter)
2017-01-09 12:42

I also attached (even though I can't reproduce under Clang 3.9):
- major_gc_clang3.9_O1.s : major_gc.c compiled on top of 4.03.0 with Clang 3.9 with the -O1 optimisation flag
- major_gc_clang3.9_O2.s : major_gc.c compiled on top of 4.03.0 with Clang 3.9 with the -O2 optimisation flag
shinwell (developer)
2017-01-09 13:09

I am going to make some investigations and report back.
cullmann (reporter)
2017-01-09 17:55

For the record, my machine which can reproduce the crash has that stepping + microcode:

cullmann@neko:~> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
stepping : 3
microcode : 0x9e
joris (reporter)
2017-01-10 05:09

test.diff attached, containing diff between a good and bad build for a dump of gcc "tree" internal representation (sort of ssa it looks like). Now this is would lead to rule out bug in ocaml because the change is really minimal. It appears it just shortcut testing the loop condition in sweep_slice when the else branch is taken and work has not been updated. That appears to be the only change
joris (reporter)
2017-01-10 11:36
edited on: 2017-01-10 11:36

I believe i have found something interesting. At some point i did a careful review of sweep_slice function and i noticed this line:

      work -= Whsize_hd (hd); (major_gc.c:545)

This macro returns an unsized int because hd is header_t. If i understand C standard correctly it means that work -= size is similar to work = work - size, and the substraction operands will be promoted to unsigned long.

I checked gcc tree SSA dump and it indeed looks like this is what GCC is doing. I tried to replace this line with

      work = work - (intnat) (Whisze_hd (hd));

It does indeed make some difference in SSA tree (just add a cast and properly execute the arithmetic substraction with signed temporaries), but i must admit i fail to understand why it would cause a segfault in this case since generated assembly used signed instruction (notq then addq). It does make some difference in assembly though, the loop condition is reversed:
it uses testq + jle instead of testq + jg, and it uses an additional register. I do not really understand what changes it makes.

That being said i'm trying to reproduce this bug for 2hours and it has not crashed in O2 with this change yet

xleroy (administrator)
2017-01-10 19:12

I admire your ingenuity in searching for GCC miscompilation issues or undefined behaviors in OCaml's sources. Yet, those issues would produce reproducible crashes, which is not the case here. Also, they would not account for the fact that crashes are observed only with Skylake and hyperthreading.
shinwell (developer)
2017-01-11 09:54

I am pursuing independently the possibility of this being a CPU bug.
I agree with what @xleroy said, although having a smaller diff of the assembly code for major_gc.c than you had before would be useful; it looks like you may now be able to obtain that.
frisch (developer)
2017-01-11 10:10

> reproducible crashes

Couldn't there be undefined behaviors at he CPU level, which would lead to non-reproducible situations depending e.g. on physical memory addresses?

Also, it is hard to get a fully reproducible behavior at the OCaml level itself. Simply getting the current pid and printing it to a string (e.g. for logging purpose) can lead to different allocation scheme of the program (depending on the length of the printed pid).
joris (reporter)
2017-01-11 10:15

Honestly, i have found this and i spent some time looking at the assembly produced but it makes no sense to me why it would behave differently. Still, please find attached :

 -major_gc_with_intnat_cast_o2.s built with gcc -O2 but with the previously described (intnat) cast. It has not crashed in 17 hours.
 - major_gc_without_cast_o2.patch the diff between this and the assembly produced by O2 without the patch, which crash.

As far as i can tell it just aligned the .text of the hot loop and use an additional temp register and the operands of work -= Whsize_hd(hd) are reversed. Besides that...
joris (reporter)
2017-01-11 10:37

> reproducible crashes

btw we tried to disable ASLR with setarch x86_64 -R ocamlfind opt... but it didn't help.
Also it is interesting to note that all the crashes happen during marking, apparently visiting corrupted memory as block header. It could be that the problem is always here but we don't see it every time, because sometimes it does not crash but gas complains about invalid input, sometime gas warns and compilation is successful.
It could be that the others time there is not visible effects. I can try diffing the emitted assembly file
schommer (reporter)
2017-01-11 12:40

During our investigation of the crashes I ran ocaml a few times with the undefined behavior sanitizer of clang, as well as clang-check and most warnings/errors reported were because of unaligned memory access.
joris (reporter)
2017-01-12 01:25

So you might have to disregard basically everything i said. I kept the patched binary running in a loop and after respectively 28h and 32h two processes crashed... So it might just be some side effect affecting how often the bug is triggered.
joris (reporter)
2017-01-12 07:58

Just to clarify the cast patch has crashed, -fno-free-vrp has not
shinwell (developer)
2017-01-25 15:31

I'd like to find out whether this problem manifests itself if the execution of the OCaml compiler is pinned to a particular processor core.

You can probably do this on Linux by using "taskset" as a wrapper around ocamlopt.opt or else by altering the runtime (e.g. in asmrun/startup.c) to call sched_setaffinity. Could you try this on a Skylake system to see if it makes the problem go away?

(As far as I understand it the problem has not been reproduced unless hyperthreading is enabled. If that is correct, hyperthreading should also be enabled for this experiment.)
xleroy (administrator)
2017-01-25 16:07

Lucky me, my new workstation at Inria is a Skylake Xeon, 4 cores, 8 threads, so I could play with the original repro case.

Without setting processor affinities: it is easy to reproduce the crash (in a few minutes at most) by running at least 5 copies of the compilation task in parallel. With 4 copies the crash happens but takes much longer. With 3 copies or less I didn't observe it in a couple of hours.

By setting processor affinities, I see the crash in a few minutes with only two compilations run in parallel, provided they are mapped on the same physical core (e.g. logical cores 1 and 5 on my machine). Two parallel runs mapped to different physical cores (e.g. logical cores 1 and 2 on my machine) have been running for 1 hour already without a glitch, but I'll let them run overnight. Finally, I'm also trying two parallel runs mapped to the same logical processor, for reference. More results tomorrow.
xleroy (administrator)
2017-01-26 10:13
edited on: 2017-01-26 10:13

More results from my overnight runs:
- Two runs in parallel, mapped to different physical cores: this crashed eventually but only after 6 hours or so. Note that the machine was not completely idle, so the other logical processors of those physical cores were probably used (lightly).
- Two runs in parallel mapped to the same logical processor: no crash after 12 hours.

shinwell (developer)
2017-03-10 11:24

As an update, this is still being investigated at Intel.
ygrek (reporter)
2017-05-26 20:42

Any chance it is what intel microcode update talks about? [^]

      Likely fix nightmare-level Skylake erratum SKL150. Fortunately,
      either this erratum is very-low-hitting, or gcc/clang/icc/msvc
      won't usually issue the affected opcode pattern and it ends up
      being rare.
      SKL150 - Short loops using both the AH/BH/CH/DH registers and
      the corresponding wide register *may* result in unpredictable
      system behavior. Requires both logical processors of the same
      core (i.e. sibling hyperthreads) to be active to trigger, as
      well as a "complex set of micro-architectural conditions"
joris (reporter)
2017-05-27 03:33

Everything in this description matches this issue. I will have to wait monday to test this though.
Thank you everyone for the time spent on this problem
joris (reporter)
2017-05-29 03:28

microcode update appears to fix the crash (microcode version 0xba). I believe this issue can be closed
shinwell (developer)
2017-05-29 07:54

Interesting. I will see if I can get Intel to confirm this. Let's leave this issue open for the moment.
shinwell (developer)
2017-05-29 10:27

I looked at the code of sweep_slice, which was conjectured to be one of the functions affected (see above, and the attachment opt.s); indeed it appears that perhaps it might trigger the problem. There is a loop with fewer than 64 instructions using both the %ah register and %rax. The use of %ah is probably quite unusual, but GCC is generating it to deal with the GC tag bits inside a header word.
shinwell (developer)
2017-05-29 10:55

By the way the original Intel description is here, on the page numbered 65: [^]
xleroy (administrator)
2017-05-29 14:48
edited on: 2017-05-31 14:04

After updating the microcode on my Xeon E3 Skylake, the test that used to crash in a few minutes has been running for 6 hours without a hiccup. I'll leave it running for a few days, but it looks like the problem is nailed down and fixed.

Update: the test ran for 50 hours and produced no failures.

shinwell (developer)
2017-05-31 10:30

This problem also affects Kaby Lake systems (erratum KBL095); however, I'm unsure if a microcode fix has been released publicly for such systems. The best solution is to disable hyperthreading for the moment.

Yesterday Fred and I experimented on a Kaby Lake machine by changing the generated assembly from GCC for major_gc.c so that it didn't reference registers such as %ah. The problem was not reproducible after that change, whereas it was almost immediately reproducible before.

If there are no further developments by the start of next week, I think we can close this issue.
joris (reporter)
2017-06-02 03:10
edited on: 2017-06-02 03:11

I see you changed the description which should help people searching for this issue in the future.

It should be noted that since the issue is triggered by the major gc, it's not only compiler. Any long running ocaml program has a high chance of triggering this, and it will not always crash. You can get corrupted data in memory and never crashing.

As an example we tried to deploy some tool on a large xeon skylake cluster, several hundred processes. They didn't crash in hours, but very quickly we saw corrupted data being sent over the network/written into the database.

So anyone reading this in the future, don't assume this is only compiler, and don't run critical code on skylake/kaby lake without updating the firmware if you don't want to end up in a nightmarish situation.

shinwell (developer)
2017-06-02 11:28

Agreed, I've updated the title of this issue.
shinwell (developer)
2017-06-09 19:02

Closing this issue as per the above.

- Issue History
Date Modified Username Field Change
2017-01-06 03:39 enguerrand New Issue
2017-01-06 03:39 enguerrand File Added: report.tar.xz
2017-01-06 04:38 joris Note Added: 0017080
2017-01-06 10:46 shinwell Note Added: 0017083
2017-01-06 10:46 shinwell Assigned To => shinwell
2017-01-06 10:46 shinwell Status new => assigned
2017-01-06 11:00 enguerrand Note Added: 0017084
2017-01-06 11:13 shinwell Note Added: 0017085
2017-01-06 11:40 enguerrand Note Added: 0017086
2017-01-06 12:03 shinwell Note Added: 0017087
2017-01-06 12:16 shinwell Note Added: 0017088
2017-01-06 12:22 shinwell Note Added: 0017089
2017-01-06 12:28 shinwell Note Added: 0017090
2017-01-06 12:30 joris Note Added: 0017091
2017-01-06 12:56 joris Note Added: 0017092
2017-01-06 13:01 shinwell Note Added: 0017093
2017-01-06 13:02 shinwell Note Added: 0017094
2017-01-06 13:10 shinwell Note Added: 0017095
2017-01-06 13:17 enguerrand Note Added: 0017096
2017-01-06 13:33 shinwell Note Added: 0017097
2017-01-06 14:23 shinwell Note Added: 0017098
2017-01-06 14:29 frisch Note Added: 0017099
2017-01-06 16:15 enguerrand Note Added: 0017100
2017-01-06 16:55 Armael Note Added: 0017101
2017-01-06 16:56 Armael Note Edited: 0017101 View Revisions
2017-01-06 17:02 Armael Note Added: 0017102
2017-01-06 17:31 Armael Note Added: 0017103
2017-01-06 17:33 Armael Note Edited: 0017103 View Revisions
2017-01-06 17:54 enguerrand Note Added: 0017104
2017-01-07 11:45 joris Note Added: 0017107
2017-01-07 11:46 joris Note Edited: 0017107 View Revisions
2017-01-07 11:46 joris Note Edited: 0017107 View Revisions
2017-01-07 14:36 enguerrand Note Added: 0017108
2017-01-07 15:36 xleroy Note Added: 0017109
2017-01-08 04:52 joris Note Added: 0017110
2017-01-08 07:25 joris Note Added: 0017111
2017-01-08 10:04 xleroy Note Added: 0017112
2017-01-08 10:56 cullmann Note Added: 0017114
2017-01-08 12:44 joris Note Added: 0017115
2017-01-08 12:45 joris Note Edited: 0017115 View Revisions
2017-01-08 12:54 joris Note Edited: 0017115 View Revisions
2017-01-08 13:59 cullmann Note Added: 0017116
2017-01-08 15:14 joris Note Added: 0017117
2017-01-08 15:21 cullmann Note Added: 0017118
2017-01-09 08:01 shinwell Note Added: 0017121
2017-01-09 09:40 shinwell Note Added: 0017122
2017-01-09 09:42 shinwell Note Added: 0017123
2017-01-09 10:05 shinwell Note Added: 0017124
2017-01-09 10:12 enguerrand Note Added: 0017125
2017-01-09 10:59 joris Note Added: 0017126
2017-01-09 11:02 cullmann Note Added: 0017127
2017-01-09 11:16 shinwell Note Added: 0017128
2017-01-09 12:02 joris Note Added: 0017129
2017-01-09 12:03 joris Note Edited: 0017129 View Revisions
2017-01-09 12:18 joris Note Added: 0017130
2017-01-09 12:19 joris Note Edited: 0017130 View Revisions
2017-01-09 12:35 joris File Added: opt.s
2017-01-09 12:36 joris File Added: unopt.s
2017-01-09 12:37 joris File Added: opt.patch
2017-01-09 12:38 enguerrand File Added: major_gc_clang3.9_O1.s
2017-01-09 12:39 enguerrand File Added: major_gc_clang3.9_O2.s
2017-01-09 12:40 joris Note Added: 0017131
2017-01-09 12:42 enguerrand Note Added: 0017132
2017-01-09 13:09 shinwell Note Added: 0017133
2017-01-09 17:55 cullmann Note Added: 0017135
2017-01-10 05:07 joris File Added: test.diff
2017-01-10 05:09 joris Note Added: 0017138
2017-01-10 11:36 joris Note Added: 0017140
2017-01-10 11:36 joris Note Edited: 0017140 View Revisions
2017-01-10 19:12 xleroy Note Added: 0017143
2017-01-11 09:54 shinwell Note Added: 0017146
2017-01-11 10:10 joris File Added: major_gc_without_cast_o2.patch
2017-01-11 10:10 frisch Note Added: 0017147
2017-01-11 10:11 joris File Added: major_gc_with_intnat_cast_o2.s
2017-01-11 10:15 joris Note Added: 0017148
2017-01-11 10:37 joris Note Added: 0017150
2017-01-11 12:40 schommer Note Added: 0017151
2017-01-12 01:25 joris Note Added: 0017154
2017-01-12 07:58 joris Note Added: 0017155
2017-01-25 15:31 shinwell Note Added: 0017191
2017-01-25 16:07 xleroy Note Added: 0017192
2017-01-26 10:13 xleroy Note Added: 0017196
2017-01-26 10:13 xleroy Note Edited: 0017196 View Revisions
2017-02-19 16:46 xleroy Target Version => later
2017-02-19 16:46 xleroy Additional Information Updated View Revisions
2017-02-23 16:35 doligez Category OCaml backend (code generation) => Back end (clambda to assembly)
2017-02-23 16:44 doligez Category Back end (clambda to assembly) => back end (clambda to assembly)
2017-03-10 11:24 shinwell Note Added: 0017615
2017-05-26 20:42 ygrek Note Added: 0017815
2017-05-27 03:33 joris Note Added: 0017816
2017-05-29 03:28 joris Note Added: 0017824
2017-05-29 07:54 shinwell Note Added: 0017825
2017-05-29 10:27 shinwell Note Added: 0017826
2017-05-29 10:55 shinwell Note Added: 0017827
2017-05-29 14:48 xleroy Note Added: 0017828
2017-05-29 14:52 xleroy Note Edited: 0017828 View Revisions
2017-05-31 10:30 shinwell Note Added: 0017834
2017-05-31 14:04 xleroy Note Edited: 0017828 View Revisions
2017-05-31 14:06 xleroy Summary Compiler segfault on large generated source code => Compiler segfault on Intel Skylake / Kaby Lake processors with hyperthreading enabled [was: on large generated source code]
2017-06-02 03:10 joris Note Added: 0017837
2017-06-02 03:11 joris Note Edited: 0017837 View Revisions
2017-06-02 11:27 shinwell Summary Compiler segfault on Intel Skylake / Kaby Lake processors with hyperthreading enabled [was: on large generated source code] => Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled
2017-06-02 11:28 shinwell Note Added: 0017838
2017-06-09 19:02 shinwell Note Added: 0017863
2017-06-09 19:02 shinwell Status assigned => closed
2017-06-09 19:02 shinwell Resolution open => no change required

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker