Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0007452OCamlOCaml backend (code generation)public2017-01-06 03:392017-01-12 07:58
Reporterenguerrand 
Assigned Toshinwell 
PrioritynormalSeveritycrashReproducibilitysometimes
StatusassignedResolutionopen 
PlatformLinuxOSDebianOS Version
Product Version4.03.0 
Target VersionFixed in Version 
Summary0007452: Compiler segfault on large generated source code
DescriptionWhile switching a 4.02.3 codebase to 4.03 recently, we stumbled upon some random crashes from the compiler, and more rarely, occurrences of bad assembly code being generated (which as failed to compile), or instruction being trapped at runtime while the compiler is running.

Those problems occurs on an OCaml source file generated using the Extprot library.

The problem doesn't seems to happen all the time.
Most of the time, the file will compile successfully, and if enough retries are given, the compiler will then crash, example of returns from dmesg after a few crashes:

[22241.838551] ocamlopt.opt[48175]: segfault at ffffffffffde7768 ip 000055f75e412e3c sp 00007ffc3ee31de0 error 7 in ocamlopt.opt[55f75e0b6000+613000]
[22985.879907] ocamlopt.opt[48221]: segfault at af8 ip 00005564455169bd sp 00007ffc9f36b130 error 4 in ocamlopt.opt[556445006000+613000]
[23936.341126] ocamlopt.opt[48306]: segfault at 5837 ip 00005641554a16c8 sp 00007ffe1278f8e0 error 4 in ocamlopt.opt[56415514a000+613000]
[25395.780978] ocamlopt.opt[48445]: segfault at ffffffffffde7608 ip 0000557e25ea5cf4 sp 00007ffc2eac79d0 error 5 in ocamlopt.opt[557e25b49000+613000]

Backtraces obtained for those crashes give us informations which doesn't seems to show always the same thing. Example backtraces can be found in the attached archive.

The compiler will more rarely generated an assembly file that as won't be able to compile:

/tmp/camlasmc92578.s: Assembler messages:
/tmp/camlasmc92578.s:1005308: Error: operand type mismatch for `add'

Where the line 1005308 is: add $2300, $5199

Or:

/tmp/camlasm601e1c.s: Assembler messages:
/tmp/camlasm601e1c.s:820172: Error: operand type mismatch for `or'

Where the line 820172 is: orq $139950828249720, %rax

We haven't noticed as of now any misbehaviour in a successfully compiled and running instance of this file, but the issue is still very new for us so we will be watching it closely.
Steps To ReproduceThe problem doesn't seems to happen all the time, at least it doesn't crash at every build. We sometimes don't witness the crash before 30 minutes of retries.

Steps to reproduce:

OCaml 4.03 and 4.04 has been witnessed as triggering the problem.
Sample file is attached as the test case used to reproduce the problem: Extprot library must be installed in order to compile the file, since it was generated using Extprot. (we use the latest version from Opam)

Test case can be found in the attachment (test.ml)

To reproduce:
Just compile this file, preferably in a loop, with this command:

while ocamlfind opt -c -g -bin-annot -ccopt -g -ccopt -O2 -ccopt -Wextra -ccopt '-Wstrict-overflow=5' -thread -w +a-4-40..42-44-45-48-58 -w -27-32 -package extprot test.ml -o test.cmx; do echo "ok"; done

Additional Information
- If the crash doesn't occur for some time, after it occured again at least once, the probability of the compiler crashing seems to be increasing

- Crash was witnessed running ocamlopt and ocamlopt.opt
TagsNo tags attached.
Attached Files? file icon report.tar.xz [^] (86,172 bytes) 2017-01-06 03:39
? file icon opt.s [^] (97,082 bytes) 2017-01-09 12:35
? file icon unopt.s [^] (100,286 bytes) 2017-01-09 12:36
patch file icon opt.patch [^] (1,260 bytes) 2017-01-09 12:37 [Show Content]
? file icon major_gc_clang3.9_O1.s [^] (51,661 bytes) 2017-01-09 12:38
? file icon major_gc_clang3.9_O2.s [^] (58,487 bytes) 2017-01-09 12:39
diff file icon test.diff [^] (1,443 bytes) 2017-01-10 05:07 [Show Content]
patch file icon major_gc_without_cast_o2.patch [^] (31,524 bytes) 2017-01-11 10:10 [Show Content]
? file icon major_gc_with_intnat_cast_o2.s [^] (38,175 bytes) 2017-01-11 10:11

- Relationships

-  Notes
(0017080)
joris (reporter)
2017-01-06 04:38

Traces are almost useless, but the memory corruption happens as frequently in byte code, with similar traces, making the runtime crash in GC or in runtime:

corrupted stack:

 Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x000055555557d336 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:298
298 accu = sp[0]; Next;
(gdb) bt
#0 0x000055555557d336 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:298
Backtrace stopped: Cannot access memory at address 0x7fffffffda68

Corrupted heap:

#0 0x000055687aa72534 in mark_slice_darken (slice_pointers=<synthetic pointer>,
    in_ephemeron=0, i=11803660, v=140605172023200, gray_vals_ptr=0x55687c7fead8)
    at major_gc.c:237
#1 mark_slice (work=853950) at major_gc.c:407
#2 0x000055687aa73384 in caml_major_collection_slice (howmuch=howmuch@entry=-1)
    at major_gc.c:767
0000003 0x000055687aa743ff in caml_gc_dispatch () at minor_gc.c:458
0000004 0x000055687aa8b807 in caml_interprete (prog=<optimized out>,
    prog_size=<optimized out>) at interp.c:673
0000005 0x000055687aa8cdf9 in caml_main (argv=0x7ffe02927368) at startup.c:374
0000006 0x000055687aa71b1c in main (argc=<optimized out>, argv=<optimized out>)
    at main.c:35

this looks like a buffer overrun while walking a large block, but the block is strange as far as i can tell:

(gdb) p (unsigned char)*(((uint64_t *)v) - 1)
$40 = 200 '\310'
(gdb) p *(((uint64_t *)v) - 1) >> 10
$45 = 137309738338
(0017083)
shinwell (developer)
2017-01-06 10:46

I will see if I can reproduce this.

In the meantime, have you reproduced this failure on more than one machine?
(0017084)
enguerrand (reporter)
2017-01-06 11:00

Reproduced it on a few differents machines (all running Debian 64bits), either VM or physical.
Still haven't tried under something else than Linux, I will check if I can reproduce under OSX
(0017085)
shinwell (developer)
2017-01-06 11:13

Some subset (although not all) of the symptoms exhibited in this report kind of look like stack overflow. To rule out such problems, can you try to reproduce this having run "ulimit -s unlimited" first? (I'm also trying to reproduce it but don't know yet whether I will be successful in doing so.)
(0017086)
enguerrand (reporter)
2017-01-06 11:40

We suspected stack overflow too and we tried to reproduce with a very large stack size and unlimited, and crashes still happened. I tried again just now just to be sure and the result is still the same
(0017087)
shinwell (developer)
2017-01-06 12:03

Right, that's what I was expecting.
(0017088)
shinwell (developer)
2017-01-06 12:16

I wonder if it's because the parameter called "i" of the function mark_slice_darken is of type "int". I think it should be of type "mlsize_t" since it's a field index. I wouldn't be surprised if this gargantuan source file produces a block that has sufficiently many fields for "i" to overflow at the moment.

Can you try changing that in your compiler tree (in 4.04 it's in byterun/major.gc line 232) and seeing if the problem goes away? I haven't managed to reproduce it yet.
(0017089)
shinwell (developer)
2017-01-06 12:22

(I will produce a GPR once you confirm)
(0017090)
shinwell (developer)
2017-01-06 12:28

Although actually, if that were to be the bug, I think ocamlopt.opt would use more memory than it does when compiling the file. So that might not be it---but I think it's wrong in any case.
(0017091)
joris (reporter)
2017-01-06 12:30

Indeed. I've launch some test to try but since this is scanning a in memory block and since ocamlopt never use more than 1.5G RES in this case i doubt it can overflow i
(0017092)
joris (reporter)
2017-01-06 12:56

Reproduced with i changed to mlsize_t, crash in ml_mark_slice, but i don't have many more info i forgot to build the runtime with -g this time
(0017093)
shinwell (developer)
2017-01-06 13:01

Given you seem to be able to reproduce this easily, can you try to get a valgrind trace?
(0017094)
shinwell (developer)
2017-01-06 13:02

(Please build the runtime with -g before doing that, in case it gives any more info)
(0017095)
shinwell (developer)
2017-01-06 13:10

In fact, another thing to try: please adjust the compiler Makefile so that it uses the debug runtime (you need a configure flag, and then I think it's "-runtime-variant d" for compile/link). This might pick something up as it enables a lot of assertions.
(0017096)
enguerrand (reporter)
2017-01-06 13:17

I will give Valgrind a bit later.
It's not that easy to reproduce even on our end, we sometimes wait 30 minutes with 5 concurrent loops to see it happen, its kind of erratic
(0017097)
shinwell (developer)
2017-01-06 13:33

OK, well maybe try the debug runtime first then, since that's probably going to be a bit easier to set up.
(0017098)
shinwell (developer)
2017-01-06 14:23

It's possible this is something related to the change between 4.02 and 4.03 that allowed the major GC to stop scanning in the middle of blocks and defer the remaining fields until later.

I think it would be worth trying to disable this, since you do seem to have some large blocks (one of the backtraces shows a block with > 5 million fields). This should be achievable by changing byterun/major_gc.c, line 403 (in 4.04): this currently reads "end = size < end ? size : end;" and I think you should change it to "end = size".
(0017099)
frisch (developer)
2017-01-06 14:29

Could this be related to 0007228 / GPR#546 ?
(0017100)
enguerrand (reporter)
2017-01-06 16:15

We tried to reproduce the issue on some server to ease the compilation time and we noticed that we couldn't reproduce the issue (as of now after running multiple loops for one hour or so).
We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.
Does someone have such a machine available to test this hypothesis ?
(0017101)
Armael (reporter)
2017-01-06 16:55
edited on: 2017-01-06 16:56

I can reproduce the bug on my laptop, which has a Skylake CPU (i7-6600U).

A single instance of the loop ran without crashing for around 20 minutes. However, shortly after adding 3 more instances in parallel, two of them crashed with "ocamlopt.opt got signal and exited". The one which was running from the start also crashed quickly after that, with "Fatal error: exception File "utils/timings.ml", line 86, characters 27-33: Assertion failed".

Last relevant lines in dmesg are:

[81744.710293] ocamlopt.opt[348]: segfault at 7fe040f1b000 ip 00000000006b1b04 sp 00007fff9f242bb0 error 4 in ocamlopt.opt[400000+36b000]
[81762.064912] ocamlopt.opt[359]: segfault at fffffffffffffff9 ip 00000000004ba582 sp 00007ffd32705ea0 error 5 in ocamlopt.opt[400000+36b000]

(0017102)
Armael (reporter)
2017-01-06 17:02

I just realized I was running all the parallel instances of the compiler in the same directory, on the same source file. So that's probably the cause of the assertion failure.
(0017103)
Armael (reporter)
2017-01-06 17:31
edited on: 2017-01-06 17:33

Just to be sure, I tried with 4 instances of the loop in 4 separate directories. 3 segfaulted almost instantly (after around 15 seconds):

[84373.677406] ocamlopt.opt[1293]: segfault at ffffffffffde530b ip 00000000004b2b34 sp 00007ffd998b1280 error 7 in ocamlopt.opt[400000+36b000]
[84404.445402] ocamlopt.opt[1315]: segfault at b18 ip 00000000004c45a7 sp 00007ffcf29d0fc0 error 4 in ocamlopt.opt[400000+36b000]
[84412.083689] ocamlopt.opt[1325]: segfault at 7f6cb22c3000 ip 00000000006b1b04 sp 00007ffed67919e0 error 4 in ocamlopt.opt[400000+36b000]

The last one finally crashed 2 minutes after:
[84471.527980] ocamlopt.opt[1298]: segfault at 469b61 ip 00007f3485db380b sp 00007ffce2ad6018 error 7 in libc-2.24.so[7f3485c8b000+195000]

(0017104)
enguerrand (reporter)
2017-01-06 17:54

joris mentioned the possibility that it might be related to the runtime being compiled with -O2 instead of -O1 since 4.03, testing 4.03 with -O1 might be interesting too
(0017107)
joris (reporter)
2017-01-07 11:45
edited on: 2017-01-07 11:46

I indeed cannot reproduce after 1h with a runtime built with -O1, while i can reproduce in less than 10 minutes with an -O2 runtime (built with gcc 6.2)
This smells like an UB which is wrongly optimized by gcc and causing issues on newer intel CPUs.
I also tried to build the runtime with -fsanitize=address and -fsanitize=undefined. Interestingly i fail to reproduce with this binary, but it is also not reporting interesting things besides the caml_stat_alloc "leak" and some unaligned reads at startup in roots.c (which should be harmless on x86). This is testing on a i7 6700 (no K) Skylake

(0017108)
enguerrand (reporter)
2017-01-07 14:36

I confirm that after compiling the runtime with -O1 I cannot reproduce after a few hours of retries. (gcc 6.2 too)
(0017109)
xleroy (administrator)
2017-01-07 15:36

> We also noticed that every machine on which we were able to reproduce the issue was running a CPU of the Intel Skylake family.

Last Spring, another OCaml (industrial) user reported mysterious semireproducible crashes of a big ocamlopt-compiled program. The crashes would occur only on Skylake processors, and only in the presence of hyperthreading.

If it is possible for you, it would be interesting to turn hyperthreading off (in the BIOS) and try to reproduce again.
(0017110)
joris (reporter)
2017-01-08 04:52

Meh. I cannot reproduce with HT disable indeed after one hour of 4 loops running concurrently...
(0017111)
joris (reporter)
2017-01-08 07:25

For the record I upgraded the uefi firmware and the Intel microcode to latest version it makes no difference.
(0017112)
xleroy (administrator)
2017-01-08 10:04

Thanks for the quick re-test without hyperthreading. This story is consistent with what was observed at the other industrial user in May. Based on their observations and those in this PR, the problem lies in the combination of:
- Skylake
- Hyperthreading
- OCaml 4.03
- compiled with gcc -O2

Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy. On the other hand, there was already one documented hardware issue with hyperthreading and Skylake: http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/ [^]
(0017114)
cullmann (reporter)
2017-01-08 10:56

To solve our issue in May, we went over from using clang instead of GCC as the base compiler for OCaml and the other parts of our toolchain.

Since that switch, no such random crash came up (and we have one Skylake machine that runs longer regression tests in an endless loop the whole year, nothing to be seen after clang usage, daily crashs before)

The question is if it is feasible for you to:

a) try clang, too
b) if that works, try to search for the difference in e.g. the produced assembly
(0017115)
joris (reporter)
2017-01-08 12:44
edited on: 2017-01-08 12:54

Whether an hardware bug is a crazy assumption, i don't know though indeed the HT condition is suspicious (Although it could have some other side effects like reducing cache pressure etc).

Cullmann thank you for your feedback, I will try clang later.

I have another piece of information to add : the problem seems to lie in one of the compiled symbols from the major gc. I tried to "bisect" the problematic files by building all runtime files in O2 one by one. Since the crashes seemed to happen during a major marking most of the time i started with major_gc.c and :

 - I can reproduce with runtime built with -O1 and only major_gc.c built with -O2
 - I cannot reproduce with runtime built with -O2 and and only major_gc.c built with -O1
 - I used minor_gc.c and memory.c as test samples and replicated the same test conditions. They behave "as expected", that is do not modify the result (crash when the other files are in O2, cannot reproduce when the other files are in O1)

(0017116)
cullmann (reporter)
2017-01-08 13:59

Btw., I just tried your report.tar.xz, on my home machine, a Skylake Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz, HT on.

4 parallel runs (in different directories) =>

/tmp/camlasm83d2dc.s: Assembler messages:
/tmp/camlasm83d2dc.s:315401: Error: operand type mismatch for `add'
File "test.ml", line 1:
Error: Assembler error, input left in file /tmp/camlasm83d2dc.s

in one of the endless loops after some hours.

(I used opam to get ocaml and the needed ext... lib, 4.03.0)
(0017117)
joris (reporter)
2017-01-08 15:14

Thanks for testing. Hours is a long time, for my last test i don't wait for more than one hour before deciding the bug does not reproduce. I hope it does not invalidate my results. On the other hand, for cases where the bug is triggers it happens almost immediately for me, less than 2 minutes, and after that it happens often.

Anyway, following advises on IRC i refined the bisection logic at function level. It looks like the bug only happens when both sweep_slice and caml_major_collection_slice are optimized with O2 using __attribute__ ((optimize(2, "no-strict-aliasing")))

Diffing the assembly is starting to be far beyond my skill set, but the obvious difference is that it uses SIMD instructions and xmm registers at O2
(0017118)
cullmann (reporter)
2017-01-08 15:21

Interesting, could you attach the assemblies O1/O2 for the two functions?
Perhaps others can derive more from it.
(0017121)
shinwell (developer)
2017-01-09 08:01

I will produce and have a look at the diff of the assembly code for those functions this morning.
(0017122)
shinwell (developer)
2017-01-09 09:40

@jorris @enguerrand Can you confirm which version of the microcode you are running on systems which exhibit this bug? It should be listed in /proc/cpuinfo on Linux.

Could you also find out whether it would be possible to update the microcode on one of these systems to the very latest version? This might be doable by updating your BIOS (depending on whether your motherboard manufacturer has released a BIOS containing the updated microcode, which at least one apparently has) or else by using some distribution-specific mechanism.

The Intel errata says that the Skylake bug that causes freezes can also cause "undefined system behaviour". I haven't managed to find any reference on Google to an instance of this bug that did something other than freezing. However it looks like we might have such an instance ourselves here. It would be very interesting to determine whether the microcode update solves the problem. If it does not I think I will file a bug with Intel.
(0017123)
shinwell (developer)
2017-01-09 09:42

Sorry, I wasn't very explicit in that last post: the freezing erratum has apparently been fixed by the latest microcode (according to Intel) although there appears to be some suggestion in the last comment on <https://communities.intel.com/thread/96157?start=30&tstart=0> [^] that some problem may still persist.
(0017124)
shinwell (developer)
2017-01-09 10:05

(I missed @joris's post saying that he'd tried this microcode update on one box, but we still need to confirm what version number that was)
(0017125)
enguerrand (reporter)
2017-01-09 10:12

Two macbooks encountering the issues seems to be using the latest microcode available, cpu infos:

> sysctl -a | grep machdep.cpu
machdep.cpu.tsc_ccc.denominator: 2
machdep.cpu.tsc_ccc.numerator: 226
machdep.cpu.thread_count: 8
machdep.cpu.core_count: 4
machdep.cpu.address_bits.virtual: 48
machdep.cpu.address_bits.physical: 39
machdep.cpu.tlb.data.small_level1: 64
machdep.cpu.tlb.data.small: 64
machdep.cpu.tlb.inst.large: 8
machdep.cpu.cache.size: 256
machdep.cpu.cache.L2_associativity: 4
machdep.cpu.cache.linesize: 64
machdep.cpu.arch_perf.fixed_width: 48
machdep.cpu.arch_perf.fixed_number: 3
machdep.cpu.arch_perf.events: 0
machdep.cpu.arch_perf.events_number: 7
machdep.cpu.arch_perf.width: 48
machdep.cpu.arch_perf.number: 4
machdep.cpu.arch_perf.version: 4
machdep.cpu.xsave.extended_state1: 15 832 256 0
machdep.cpu.xsave.extended_state: 31 832 1088 0
machdep.cpu.thermal.energy_policy: 1
machdep.cpu.thermal.hardware_feedback: 0
machdep.cpu.thermal.package_thermal_intr: 1
machdep.cpu.thermal.fine_grain_clock_mod: 1
machdep.cpu.thermal.core_power_limits: 1
machdep.cpu.thermal.ACNT_MCNT: 1
machdep.cpu.thermal.thresholds: 2
machdep.cpu.thermal.invariant_APIC_timer: 1
machdep.cpu.thermal.dynamic_acceleration: 1
machdep.cpu.thermal.sensor: 1
machdep.cpu.mwait.sub_Cstates: 286531872
machdep.cpu.mwait.extensions: 3
machdep.cpu.mwait.linesize_max: 64
machdep.cpu.mwait.linesize_min: 64
machdep.cpu.processor_flag: 5
machdep.cpu.microcode_version: 174
machdep.cpu.cores_per_package: 8
machdep.cpu.logical_per_package: 16
machdep.cpu.extfeatures: SYSCALL XD 1GBPAGE EM64T LAHF LZCNT PREFETCHW RDTSCP TSCI
machdep.cpu.leaf7_features: SMEP ERMS RDWRFSGS TSC_THREAD_OFFSET BMI1 HLE AVX2 BMI2 INVPCID RTM SMAP RDSEED ADX IPT SGX FPU_CSDS MPX CLFSOPT
machdep.cpu.features: FPU VME DE PSE TSC MSR PAE MCE CX8 APIC SEP MTRR PGE MCA CMOV PAT PSE36 CLFSH DS ACPI MMX FXSR SSE SSE2 SS HTT TM PBE SSE3 PCLMULQDQ DTES64 MON DSCPL VMX SMX EST TM2 SSSE3 FMA CX16 TPR PDCM SSE4.1 SSE4.2 x2APIC MOVBE POPCNT AES PCID XSAVE OSXSAVE SEGLIM64 TSCTMR AVX1.0 RDRAND F16C
machdep.cpu.brand: 0
machdep.cpu.signature: 329443
machdep.cpu.extfeature_bits: 1241984796928
machdep.cpu.leaf7_feature_bits: 43806655
machdep.cpu.feature_bits: 9221960262849657855
machdep.cpu.stepping: 3
machdep.cpu.extfamily: 0
machdep.cpu.extmodel: 5
machdep.cpu.model: 94
machdep.cpu.family: 6
machdep.cpu.brand_string: Intel(R) Core(TM) i7-6820HQ CPU @ 2.70GHz
machdep.cpu.vendor: GenuineIntel
machdep.cpu.max_ext: 2147483656
machdep.cpu.max_basic: 22
(0017126)
joris (reporter)
2017-01-09 10:59

and on my office machine microcode 0x76, which contains the fix. I upgraded to latest version at home which i cannot give right now. In any case i find it suspicious that it could be skylake erratum 121 because it involves AVX. The only thing which can add AVX to ocamlopt is glibc, i can try to build statically to confirm without avx if you want.
Also, i observed nearly one hundred crashes by now and none of them leaked out of the process address space, they didn't affect any other system component.
I found something interesting i think, might report this in an hour once i'm confident enough in the results
(0017127)
cullmann (reporter)
2017-01-09 11:02

We did all microcode updates, too, and we still got the crash, and the same behavior, all other things work on the machines (and we have some) beside OCaml, which is strange, as other things use much more AVX, like the glibc things and we did extensive prime95 runs on some machines which lead to no issues.
(0017128)
shinwell (developer)
2017-01-09 11:16

I was suspecting it might be related to erratum SKL082, not SKL121. I haven't managed to find anything very definitive about what triggers 082, but it sounds like it's probably floating point code (not necessarily AVX). See page 47 of:

http://www.intel.co.uk/content/dam/www/public/us/en/documents/specification-updates/desktop-6th-gen-core-family-spec-update.pdf [^]
(0017129)
joris (reporter)
2017-01-09 12:02
edited on: 2017-01-09 12:03

I spent some time looking at the set of enabled optimizations at -O2, and checking for interesting one, then trying to enable/disable looking at what impact it had on assembly. After refining a set of interesting optimization i performed an exhaustive check, and it appears Value Range Propagation is enough to trigger the issue.

It appears that :

a) when runtime is built with -O2 -fno-tree-vrp the crash does not manifest after one hour of testing
b) when runtime is built with -O2 and sweep_slice and major_collection_slice are built in O2 with no-tree-vrp the crash does not manifest
c) building the runtime in -O1 with sweep_slice and major_collection_slice built in O1 -ftree-vrp is enough to reproduce.

Could it be some range check bound assumptions being wrongly propagated down to the sweep phase ? i didn't spot anything obvious yet, i will upload assembly diff.

I only noticed that at major_gc.c:764 (4.03 branch)
    computed_work = (intnat) (p * caml_stat_heap_wsz * 5 / 3);

there is no cast to double and this branch is taken in sweep_slice. But it should be harmless as far as i know since p is a double and i fail to see what problem it might cause later.

(0017130)
joris (reporter)
2017-01-09 12:18
edited on: 2017-01-09 12:19

I just realized that this is line is a small variation of the issue Alain Frisch posted two days ago...

(0017131)
joris (reporter)
2017-01-09 12:40

I attached :
 - opt.patch, a diff on top of ocaml 4.03 (2449d2fc7) which is enough to trigger the bug on top of -O1
 - unopt.s the assembly generated by gcc for major_gc.c in -O1
 - opt.s the assembly generated by gcc for major_gc.c with opt.patch (bad binary)
(0017132)
enguerrand (reporter)
2017-01-09 12:42

I also attached (even though I can't reproduce under Clang 3.9):
- major_gc_clang3.9_O1.s : major_gc.c compiled on top of 4.03.0 with Clang 3.9 with the -O1 optimisation flag
- major_gc_clang3.9_O2.s : major_gc.c compiled on top of 4.03.0 with Clang 3.9 with the -O2 optimisation flag
(0017133)
shinwell (developer)
2017-01-09 13:09

I am going to make some investigations and report back.
(0017135)
cullmann (reporter)
2017-01-09 17:55

For the record, my machine which can reproduce the crash has that stepping + microcode:

cullmann@neko:~> cat /proc/cpuinfo
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 94
model name : Intel(R) Core(TM) i7-6700K CPU @ 4.00GHz
stepping : 3
microcode : 0x9e
(0017138)
joris (reporter)
2017-01-10 05:09

test.diff attached, containing diff between a good and bad build for a dump of gcc "tree" internal representation (sort of ssa it looks like). Now this is would lead to rule out bug in ocaml because the change is really minimal. It appears it just shortcut testing the loop condition in sweep_slice when the else branch is taken and work has not been updated. That appears to be the only change
(0017140)
joris (reporter)
2017-01-10 11:36
edited on: 2017-01-10 11:36

I believe i have found something interesting. At some point i did a careful review of sweep_slice function and i noticed this line:

      work -= Whsize_hd (hd); (major_gc.c:545)

This macro returns an unsized int because hd is header_t. If i understand C standard correctly it means that work -= size is similar to work = work - size, and the substraction operands will be promoted to unsigned long.

I checked gcc tree SSA dump and it indeed looks like this is what GCC is doing. I tried to replace this line with

      work = work - (intnat) (Whisze_hd (hd));

It does indeed make some difference in SSA tree (just add a cast and properly execute the arithmetic substraction with signed temporaries), but i must admit i fail to understand why it would cause a segfault in this case since generated assembly used signed instruction (notq then addq). It does make some difference in assembly though, the loop condition is reversed:
it uses testq + jle instead of testq + jg, and it uses an additional register. I do not really understand what changes it makes.

That being said i'm trying to reproduce this bug for 2hours and it has not crashed in O2 with this change yet

(0017143)
xleroy (administrator)
2017-01-10 19:12

I admire your ingenuity in searching for GCC miscompilation issues or undefined behaviors in OCaml's sources. Yet, those issues would produce reproducible crashes, which is not the case here. Also, they would not account for the fact that crashes are observed only with Skylake and hyperthreading.
(0017146)
shinwell (developer)
2017-01-11 09:54

I am pursuing independently the possibility of this being a CPU bug.
I agree with what @xleroy said, although having a smaller diff of the assembly code for major_gc.c than you had before would be useful; it looks like you may now be able to obtain that.
(0017147)
frisch (developer)
2017-01-11 10:10

> reproducible crashes

Couldn't there be undefined behaviors at he CPU level, which would lead to non-reproducible situations depending e.g. on physical memory addresses?

Also, it is hard to get a fully reproducible behavior at the OCaml level itself. Simply getting the current pid and printing it to a string (e.g. for logging purpose) can lead to different allocation scheme of the program (depending on the length of the printed pid).
(0017148)
joris (reporter)
2017-01-11 10:15

Honestly, i have found this and i spent some time looking at the assembly produced but it makes no sense to me why it would behave differently. Still, please find attached :

 -major_gc_with_intnat_cast_o2.s built with gcc -O2 but with the previously described (intnat) cast. It has not crashed in 17 hours.
 - major_gc_without_cast_o2.patch the diff between this and the assembly produced by O2 without the patch, which crash.

As far as i can tell it just aligned the .text of the hot loop and use an additional temp register and the operands of work -= Whsize_hd(hd) are reversed. Besides that...
(0017150)
joris (reporter)
2017-01-11 10:37

> reproducible crashes

btw we tried to disable ASLR with setarch x86_64 -R ocamlfind opt... but it didn't help.
Also it is interesting to note that all the crashes happen during marking, apparently visiting corrupted memory as block header. It could be that the problem is always here but we don't see it every time, because sometimes it does not crash but gas complains about invalid input, sometime gas warns and compilation is successful.
It could be that the others time there is not visible effects. I can try diffing the emitted assembly file
(0017151)
schommer (reporter)
2017-01-11 12:40

During our investigation of the crashes I ran ocaml a few times with the undefined behavior sanitizer of clang, as well as clang-check and most warnings/errors reported were because of unaligned memory access.
(0017154)
joris (reporter)
2017-01-12 01:25

So you might have to disregard basically everything i said. I kept the patched binary running in a loop and after respectively 28h and 32h two processes crashed... So it might just be some side effect affecting how often the bug is triggered.
(0017155)
joris (reporter)
2017-01-12 07:58

Just to clarify the cast patch has crashed, -fno-free-vrp has not

- Issue History
Date Modified Username Field Change
2017-01-06 03:39 enguerrand New Issue
2017-01-06 03:39 enguerrand File Added: report.tar.xz
2017-01-06 04:38 joris Note Added: 0017080
2017-01-06 10:46 shinwell Note Added: 0017083
2017-01-06 10:46 shinwell Assigned To => shinwell
2017-01-06 10:46 shinwell Status new => assigned
2017-01-06 11:00 enguerrand Note Added: 0017084
2017-01-06 11:13 shinwell Note Added: 0017085
2017-01-06 11:40 enguerrand Note Added: 0017086
2017-01-06 12:03 shinwell Note Added: 0017087
2017-01-06 12:16 shinwell Note Added: 0017088
2017-01-06 12:22 shinwell Note Added: 0017089
2017-01-06 12:28 shinwell Note Added: 0017090
2017-01-06 12:30 joris Note Added: 0017091
2017-01-06 12:56 joris Note Added: 0017092
2017-01-06 13:01 shinwell Note Added: 0017093
2017-01-06 13:02 shinwell Note Added: 0017094
2017-01-06 13:10 shinwell Note Added: 0017095
2017-01-06 13:17 enguerrand Note Added: 0017096
2017-01-06 13:33 shinwell Note Added: 0017097
2017-01-06 14:23 shinwell Note Added: 0017098
2017-01-06 14:29 frisch Note Added: 0017099
2017-01-06 16:15 enguerrand Note Added: 0017100
2017-01-06 16:55 Armael Note Added: 0017101
2017-01-06 16:56 Armael Note Edited: 0017101 View Revisions
2017-01-06 17:02 Armael Note Added: 0017102
2017-01-06 17:31 Armael Note Added: 0017103
2017-01-06 17:33 Armael Note Edited: 0017103 View Revisions
2017-01-06 17:54 enguerrand Note Added: 0017104
2017-01-07 11:45 joris Note Added: 0017107
2017-01-07 11:46 joris Note Edited: 0017107 View Revisions
2017-01-07 11:46 joris Note Edited: 0017107 View Revisions
2017-01-07 14:36 enguerrand Note Added: 0017108
2017-01-07 15:36 xleroy Note Added: 0017109
2017-01-08 04:52 joris Note Added: 0017110
2017-01-08 07:25 joris Note Added: 0017111
2017-01-08 10:04 xleroy Note Added: 0017112
2017-01-08 10:56 cullmann Note Added: 0017114
2017-01-08 12:44 joris Note Added: 0017115
2017-01-08 12:45 joris Note Edited: 0017115 View Revisions
2017-01-08 12:54 joris Note Edited: 0017115 View Revisions
2017-01-08 13:59 cullmann Note Added: 0017116
2017-01-08 15:14 joris Note Added: 0017117
2017-01-08 15:21 cullmann Note Added: 0017118
2017-01-09 08:01 shinwell Note Added: 0017121
2017-01-09 09:40 shinwell Note Added: 0017122
2017-01-09 09:42 shinwell Note Added: 0017123
2017-01-09 10:05 shinwell Note Added: 0017124
2017-01-09 10:12 enguerrand Note Added: 0017125
2017-01-09 10:59 joris Note Added: 0017126
2017-01-09 11:02 cullmann Note Added: 0017127
2017-01-09 11:16 shinwell Note Added: 0017128
2017-01-09 12:02 joris Note Added: 0017129
2017-01-09 12:03 joris Note Edited: 0017129 View Revisions
2017-01-09 12:18 joris Note Added: 0017130
2017-01-09 12:19 joris Note Edited: 0017130 View Revisions
2017-01-09 12:35 joris File Added: opt.s
2017-01-09 12:36 joris File Added: unopt.s
2017-01-09 12:37 joris File Added: opt.patch
2017-01-09 12:38 enguerrand File Added: major_gc_clang3.9_O1.s
2017-01-09 12:39 enguerrand File Added: major_gc_clang3.9_O2.s
2017-01-09 12:40 joris Note Added: 0017131
2017-01-09 12:42 enguerrand Note Added: 0017132
2017-01-09 13:09 shinwell Note Added: 0017133
2017-01-09 17:55 cullmann Note Added: 0017135
2017-01-10 05:07 joris File Added: test.diff
2017-01-10 05:09 joris Note Added: 0017138
2017-01-10 11:36 joris Note Added: 0017140
2017-01-10 11:36 joris Note Edited: 0017140 View Revisions
2017-01-10 19:12 xleroy Note Added: 0017143
2017-01-11 09:54 shinwell Note Added: 0017146
2017-01-11 10:10 joris File Added: major_gc_without_cast_o2.patch
2017-01-11 10:10 frisch Note Added: 0017147
2017-01-11 10:11 joris File Added: major_gc_with_intnat_cast_o2.s
2017-01-11 10:15 joris Note Added: 0017148
2017-01-11 10:37 joris Note Added: 0017150
2017-01-11 12:40 schommer Note Added: 0017151
2017-01-12 01:25 joris Note Added: 0017154
2017-01-12 07:58 joris Note Added: 0017155


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker