New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfaults or wrong code execution on Intel Skylake / Kaby Lake CPUs with hyperthreading enabled #7452
Comments
Comment author: joris Traces are almost useless, but the memory corruption happens as frequently in byte code, with similar traces, making the runtime crash in GC or in runtime: corrupted stack: Program terminated with signal SIGSEGV, Segmentation fault. Corrupted heap: #0 0x000055687aa72534 in mark_slice_darken (slice_pointers=, this looks like a buffer overrun while walking a large block, but the block is strange as far as i can tell: (gdb) p (unsigned char)*(((uint64_t *)v) - 1) |
Comment author: @mshinwell I will see if I can reproduce this. In the meantime, have you reproduced this failure on more than one machine? |
Comment author: enguerrand Reproduced it on a few differents machines (all running Debian 64bits), either VM or physical. |
Comment author: @mshinwell Some subset (although not all) of the symptoms exhibited in this report kind of look like stack overflow. To rule out such problems, can you try to reproduce this having run "ulimit -s unlimited" first? (I'm also trying to reproduce it but don't know yet whether I will be successful in doing so.) |
Comment author: enguerrand We suspected stack overflow too and we tried to reproduce with a very large stack size and unlimited, and crashes still happened. I tried again just now just to be sure and the result is still the same |
Comment author: @mshinwell Right, that's what I was expecting. |
Comment author: @mshinwell I wonder if it's because the parameter called "i" of the function mark_slice_darken is of type "int". I think it should be of type "mlsize_t" since it's a field index. I wouldn't be surprised if this gargantuan source file produces a block that has sufficiently many fields for "i" to overflow at the moment. Can you try changing that in your compiler tree (in 4.04 it's in byterun/major.gc line 232) and seeing if the problem goes away? I haven't managed to reproduce it yet. |
Comment author: @mshinwell (I will produce a GPR once you confirm) |
Comment author: @mshinwell Although actually, if that were to be the bug, I think ocamlopt.opt would use more memory than it does when compiling the file. So that might not be it---but I think it's wrong in any case. |
Comment author: joris Indeed. I've launch some test to try but since this is scanning a in memory block and since ocamlopt never use more than 1.5G RES in this case i doubt it can overflow i |
Comment author: joris Reproduced with i changed to mlsize_t, crash in ml_mark_slice, but i don't have many more info i forgot to build the runtime with -g this time |
Comment author: @mshinwell Given you seem to be able to reproduce this easily, can you try to get a valgrind trace? |
Comment author: @mshinwell (Please build the runtime with -g before doing that, in case it gives any more info) |
Comment author: @mshinwell In fact, another thing to try: please adjust the compiler Makefile so that it uses the debug runtime (you need a configure flag, and then I think it's "-runtime-variant d" for compile/link). This might pick something up as it enables a lot of assertions. |
Comment author: enguerrand I will give Valgrind a bit later. |
Comment author: @mshinwell OK, well maybe try the debug runtime first then, since that's probably going to be a bit easier to set up. |
Comment author: @mshinwell It's possible this is something related to the change between 4.02 and 4.03 that allowed the major GC to stop scanning in the middle of blocks and defer the remaining fields until later. I think it would be worth trying to disable this, since you do seem to have some large blocks (one of the backtraces shows a block with > 5 million fields). This should be achievable by changing byterun/major_gc.c, line 403 (in 4.04): this currently reads "end = size < end ? size : end;" and I think you should change it to "end = size". |
Comment author: @alainfrisch |
Comment author: enguerrand We tried to reproduce the issue on some server to ease the compilation time and we noticed that we couldn't reproduce the issue (as of now after running multiple loops for one hour or so). |
Comment author: @Armael I can reproduce the bug on my laptop, which has a Skylake CPU (i7-6600U). A single instance of the loop ran without crashing for around 20 minutes. However, shortly after adding 3 more instances in parallel, two of them crashed with "ocamlopt.opt got signal and exited". The one which was running from the start also crashed quickly after that, with "Fatal error: exception File "utils/timings.ml", line 86, characters 27-33: Assertion failed". Last relevant lines in dmesg are: [81744.710293] ocamlopt.opt[348]: segfault at 7fe040f1b000 ip 00000000006b1b04 sp 00007fff9f242bb0 error 4 in ocamlopt.opt[400000+36b000] |
Comment author: @Armael I just realized I was running all the parallel instances of the compiler in the same directory, on the same source file. So that's probably the cause of the assertion failure. |
Comment author: @Armael Just to be sure, I tried with 4 instances of the loop in 4 separate directories. 3 segfaulted almost instantly (after around 15 seconds): [84373.677406] ocamlopt.opt[1293]: segfault at ffffffffffde530b ip 00000000004b2b34 sp 00007ffd998b1280 error 7 in ocamlopt.opt[400000+36b000] The last one finally crashed 2 minutes after: |
Comment author: enguerrand joris mentioned the possibility that it might be related to the runtime being compiled with -O2 instead of -O1 since 4.03, testing 4.03 with -O1 might be interesting too |
Comment author: joris I indeed cannot reproduce after 1h with a runtime built with -O1, while i can reproduce in less than 10 minutes with an -O2 runtime (built with gcc 6.2) |
Comment author: enguerrand I confirm that after compiling the runtime with -O1 I cannot reproduce after a few hours of retries. (gcc 6.2 too) |
Comment author: @xavierleroy
Last Spring, another OCaml (industrial) user reported mysterious semireproducible crashes of a big ocamlopt-compiled program. The crashes would occur only on Skylake processors, and only in the presence of hyperthreading. If it is possible for you, it would be interesting to turn hyperthreading off (in the BIOS) and try to reproduce again. |
Comment author: joris Meh. I cannot reproduce with HT disable indeed after one hour of 4 loops running concurrently... |
Comment author: joris For the record I upgraded the uefi firmware and the Intel microcode to latest version it makes no difference. |
Comment author: @xavierleroy Thanks for the quick re-test without hyperthreading. This story is consistent with what was observed at the other industrial user in May. Based on their observations and those in this PR, the problem lies in the combination of:
Is it crazy to imagine that gcc -O2 on the OCaml 4.03 runtime produces a specific instruction sequence that causes hardware issues in (some steppings of) Skylake processors with hyperthreading? Perhaps it is crazy. On the other hand, there was already one documented hardware issue with hyperthreading and Skylake: http://arstechnica.com/gadgets/2016/01/intel-skylake-bug-causes-pcs-to-freeze-during-complex-workloads/ |
Comment author: cullmann To solve our issue in May, we went over from using clang instead of GCC as the base compiler for OCaml and the other parts of our toolchain. Since that switch, no such random crash came up (and we have one Skylake machine that runs longer regression tests in an endless loop the whole year, nothing to be seen after clang usage, daily crashs before) The question is if it is feasible for you to: a) try clang, too |
Comment author: joris I believe i have found something interesting. At some point i did a careful review of sweep_slice function and i noticed this line:
This macro returns an unsized int because hd is header_t. If i understand C standard correctly it means that work -= size is similar to work = work - size, and the substraction operands will be promoted to unsigned long. I checked gcc tree SSA dump and it indeed looks like this is what GCC is doing. I tried to replace this line with
It does indeed make some difference in SSA tree (just add a cast and properly execute the arithmetic substraction with signed temporaries), but i must admit i fail to understand why it would cause a segfault in this case since generated assembly used signed instruction (notq then addq). It does make some difference in assembly though, the loop condition is reversed: That being said i'm trying to reproduce this bug for 2hours and it has not crashed in O2 with this change yet |
Comment author: @xavierleroy I admire your ingenuity in searching for GCC miscompilation issues or undefined behaviors in OCaml's sources. Yet, those issues would produce reproducible crashes, which is not the case here. Also, they would not account for the fact that crashes are observed only with Skylake and hyperthreading. |
Comment author: @mshinwell I am pursuing independently the possibility of this being a CPU bug. |
Comment author: @alainfrisch
Couldn't there be undefined behaviors at he CPU level, which would lead to non-reproducible situations depending e.g. on physical memory addresses? Also, it is hard to get a fully reproducible behavior at the OCaml level itself. Simply getting the current pid and printing it to a string (e.g. for logging purpose) can lead to different allocation scheme of the program (depending on the length of the printed pid). |
Comment author: joris Honestly, i have found this and i spent some time looking at the assembly produced but it makes no sense to me why it would behave differently. Still, please find attached : -major_gc_with_intnat_cast_o2.s built with gcc -O2 but with the previously described (intnat) cast. It has not crashed in 17 hours.
As far as i can tell it just aligned the .text of the hot loop and use an additional temp register and the operands of work -= Whsize_hd(hd) are reversed. Besides that... |
Comment author: joris
btw we tried to disable ASLR with setarch x86_64 -R ocamlfind opt... but it didn't help. |
Comment author: schommer During our investigation of the crashes I ran ocaml a few times with the undefined behavior sanitizer of clang, as well as clang-check and most warnings/errors reported were because of unaligned memory access. |
Comment author: joris So you might have to disregard basically everything i said. I kept the patched binary running in a loop and after respectively 28h and 32h two processes crashed... So it might just be some side effect affecting how often the bug is triggered. |
Comment author: joris Just to clarify the cast patch has crashed, -fno-free-vrp has not |
Comment author: @mshinwell I'd like to find out whether this problem manifests itself if the execution of the OCaml compiler is pinned to a particular processor core. You can probably do this on Linux by using "taskset" as a wrapper around ocamlopt.opt or else by altering the runtime (e.g. in asmrun/startup.c) to call sched_setaffinity. Could you try this on a Skylake system to see if it makes the problem go away? (As far as I understand it the problem has not been reproduced unless hyperthreading is enabled. If that is correct, hyperthreading should also be enabled for this experiment.) |
Comment author: @xavierleroy Lucky me, my new workstation at Inria is a Skylake Xeon, 4 cores, 8 threads, so I could play with the original repro case. Without setting processor affinities: it is easy to reproduce the crash (in a few minutes at most) by running at least 5 copies of the compilation task in parallel. With 4 copies the crash happens but takes much longer. With 3 copies or less I didn't observe it in a couple of hours. By setting processor affinities, I see the crash in a few minutes with only two compilations run in parallel, provided they are mapped on the same physical core (e.g. logical cores 1 and 5 on my machine). Two parallel runs mapped to different physical cores (e.g. logical cores 1 and 2 on my machine) have been running for 1 hour already without a glitch, but I'll let them run overnight. Finally, I'm also trying two parallel runs mapped to the same logical processor, for reference. More results tomorrow. |
Comment author: @xavierleroy More results from my overnight runs:
|
Comment author: @mshinwell As an update, this is still being investigated at Intel. |
Comment author: @ygrek Any chance it is what intel microcode update talks about?
|
Comment author: joris Everything in this description matches this issue. I will have to wait monday to test this though. |
Comment author: joris microcode update appears to fix the crash (microcode version 0xba). I believe this issue can be closed |
Comment author: @mshinwell Interesting. I will see if I can get Intel to confirm this. Let's leave this issue open for the moment. |
Comment author: @mshinwell I looked at the code of sweep_slice, which was conjectured to be one of the functions affected (see above, and the attachment opt.s); indeed it appears that perhaps it might trigger the problem. There is a loop with fewer than 64 instructions using both the %ah register and %rax. The use of %ah is probably quite unusual, but GCC is generating it to deal with the GC tag bits inside a header word. |
Comment author: @mshinwell By the way the original Intel description is here, on the page numbered 65: |
Comment author: @xavierleroy After updating the microcode on my Xeon E3 Skylake, the test that used to crash in a few minutes has been running for 6 hours without a hiccup. I'll leave it running for a few days, but it looks like the problem is nailed down and fixed. Update: the test ran for 50 hours and produced no failures. |
Comment author: @mshinwell This problem also affects Kaby Lake systems (erratum KBL095); however, I'm unsure if a microcode fix has been released publicly for such systems. The best solution is to disable hyperthreading for the moment. Yesterday Fred and I experimented on a Kaby Lake machine by changing the generated assembly from GCC for major_gc.c so that it didn't reference registers such as %ah. The problem was not reproducible after that change, whereas it was almost immediately reproducible before. If there are no further developments by the start of next week, I think we can close this issue. |
Comment author: joris I see you changed the description which should help people searching for this issue in the future. It should be noted that since the issue is triggered by the major gc, it's not only compiler. Any long running ocaml program has a high chance of triggering this, and it will not always crash. You can get corrupted data in memory and never crashing. As an example we tried to deploy some tool on a large xeon skylake cluster, several hundred processes. They didn't crash in hours, but very quickly we saw corrupted data being sent over the network/written into the database. So anyone reading this in the future, don't assume this is only compiler, and don't run critical code on skylake/kaby lake without updating the firmware if you don't want to end up in a nightmarish situation. |
Comment author: @mshinwell Agreed, I've updated the title of this issue. |
Comment author: @mshinwell Closing this issue as per the above. |
Original bug ID: 7452
Reporter: enguerrand
Assigned to: @mshinwell
Status: closed (set by @mshinwell on 2017-06-09T17:02:32Z)
Resolution: not a bug
Priority: normal
Severity: crash
Platform: Linux
OS: Debian
Version: 4.03.0
Target version: later
Category: back end (clambda to assembly)
Monitored by: @gasche @ygrek @yallop @alainfrisch
Bug description
While switching a 4.02.3 codebase to 4.03 recently, we stumbled upon some random crashes from the compiler, and more rarely, occurrences of bad assembly code being generated (which as failed to compile), or instruction being trapped at runtime while the compiler is running.
Those problems occurs on an OCaml source file generated using the Extprot library.
The problem doesn't seems to happen all the time.
Most of the time, the file will compile successfully, and if enough retries are given, the compiler will then crash, example of returns from dmesg after a few crashes:
[22241.838551] ocamlopt.opt[48175]: segfault at ffffffffffde7768 ip 000055f75e412e3c sp 00007ffc3ee31de0 error 7 in ocamlopt.opt[55f75e0b6000+613000]
[22985.879907] ocamlopt.opt[48221]: segfault at af8 ip 00005564455169bd sp 00007ffc9f36b130 error 4 in ocamlopt.opt[556445006000+613000]
[23936.341126] ocamlopt.opt[48306]: segfault at 5837 ip 00005641554a16c8 sp 00007ffe1278f8e0 error 4 in ocamlopt.opt[56415514a000+613000]
[25395.780978] ocamlopt.opt[48445]: segfault at ffffffffffde7608 ip 0000557e25ea5cf4 sp 00007ffc2eac79d0 error 5 in ocamlopt.opt[557e25b49000+613000]
Backtraces obtained for those crashes give us informations which doesn't seems to show always the same thing. Example backtraces can be found in the attached archive.
The compiler will more rarely generated an assembly file that as won't be able to compile:
/tmp/camlasmc92578.s: Assembler messages:
/tmp/camlasmc92578.s:1005308: Error: operand type mismatch for `add'
Where the line 1005308 is: add $2300, $5199
Or:
/tmp/camlasm601e1c.s: Assembler messages:
/tmp/camlasm601e1c.s:820172: Error: operand type mismatch for `or'
Where the line 820172 is: orq $139950828249720, %rax
We haven't noticed as of now any misbehaviour in a successfully compiled and running instance of this file, but the issue is still very new for us so we will be watching it closely.
Steps to reproduce
The problem doesn't seems to happen all the time, at least it doesn't crash at every build. We sometimes don't witness the crash before 30 minutes of retries.
Steps to reproduce:
OCaml 4.03 and 4.04 has been witnessed as triggering the problem.
Sample file is attached as the test case used to reproduce the problem: Extprot library must be installed in order to compile the file, since it was generated using Extprot. (we use the latest version from Opam)
Test case can be found in the attachment (test.ml)
To reproduce:
Just compile this file, preferably in a loop, with this command:
while ocamlfind opt -c -g -bin-annot -ccopt -g -ccopt -O2 -ccopt -Wextra -ccopt '-Wstrict-overflow=5' -thread -w +a-4-40..42-44-45-48-58 -w -27-32 -package extprot test.ml -o test.cmx; do echo "ok"; done
Additional information
If the crash doesn't occur for some time, after it occured again at least once, the probability of the compiler crashing seems to be increasing
Crash was witnessed running ocamlopt and ocamlopt.opt
File attachments
The text was updated successfully, but these errors were encountered: