Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File "camlinternalMod.ml", line 63, characters 6-12: Assertion failed. #6329

Closed
vicuna opened this issue Feb 18, 2014 · 23 comments
Closed

File "camlinternalMod.ml", line 63, characters 6-12: Assertion failed. #6329

vicuna opened this issue Feb 18, 2014 · 23 comments

Comments

@vicuna
Copy link

vicuna commented Feb 18, 2014

Original bug ID: 6329
Reporter: gmelquiond
Assigned to: @mshinwell
Status: resolved (set by @mshinwell on 2017-10-20T09:00:38Z)
Resolution: duplicate
Priority: low
Severity: major
Version: 4.01.0
Target version: 4.06.0 +dev/beta1/beta2/rc1
Category: runtime system and C interface
Tags: recmod

Bug description

There is an assertion failed in the OCaml runtime when dynamically loading a recursive module. Unfortunately, the bug is highly sensitive to conditions: changing the size of the code segment, changing the size of the data segment, changing the size of the read-only data segment, changing the number of relocations, etc, all cause the bug to vanish. As a consequence, we were unable to produce a small testcase, sorry. This bug report is mostly for the record. That said, it is perfectly reproducible and we were able to trigger it on three systems with OCaml 4.01.0 and Coq 8.4pl3.

Steps to reproduce

git clone git://scm.gforge.inria.fr/why3/why3.git
cd why3
git checkout a8c13660788c726a134fab9109e384592881e8d7
autoconf
./configure
cat > src/util/config.ml << EOF
let version = "1234567812345678123456781234567812345678123456781234567812345678123456781234567"
let builddate = ""
let libdir = ""
let datadir = ""
let localdir = Some ""
let compile_time_support = [ "", ""; "", "" ]
EOF
make src/coq-tactic/.why3-vo-opt

@vicuna
Copy link
Author

vicuna commented Feb 20, 2014

Comment author: @alainfrisch

Could it be the case that two units of the same name where dynlinked in the same process? Do you use -pack?

@vicuna
Copy link
Author

vicuna commented Feb 20, 2014

Comment author: @alainfrisch

$ make src/coq-tactic/.why3-vo-opt
...
...
Ocamldep src/util/extmap.ml
Ocamldep src/util/strings.ml
Ocamldep src/util/lists.ml
Ocamldep src/util/opt.ml
Ocamldep src/util/util.ml
Ocamldep src/util/config.ml
make: *** No rule to make target `src/coq-tactic/.why3-vo-opt'. Stop.

Do we need to install anything else in addition to OCaml itself?

@vicuna
Copy link
Author

vicuna commented Feb 20, 2014

Comment author: gmelquiond

I don't think there are two units with the same name. (Any way to make sure?) We are using -pack, so that all our modules are submodules of Why3.

Also, you need Coq, since the dynamically-loaded module is actually a Coq plugin and it depends on various Coq modules. Due to the high sensitivity on the size of the segments and the number of relocations, I was unfortunately not able to remove Coq from the equation.

@vicuna
Copy link
Author

vicuna commented May 30, 2014

Comment author: @mshinwell

I tried to reproduce this, building a clean 4.01 OCaml installation, and then compiling coq-8.4pl3 from the distribution tarball. With both of the installation directories on my PATH, the configure script for why3 said:

             Summary

Verbose make : no
OCaml compiler : yes
Version : 4.01.1+dev2-2013-12-18+CLOSED
Library path : /home/mark/dev/pr/6329/ocaml/lib/ocaml
Native compilation : yes
Profiling : no
IDE : no (lablgtk2 not found)
Bench tool : no (sqlite3 not found)
Documentation : no (rubber not found)
Coq support : yes
Version : 8.4pl3
Library path : /home/mark/dev/pr/6329/coq-install/lib/coq
"why3" tactic : no (camlp5o not found)
Realization support : yes
FP arithmetic : no (Flocq >= 2.2 not found)
PVS support : no (pvs not found)
Isabelle support : no (isabelle not found)
Frama-C support : no
Hypothesis selection : no (ocamlgraph not found)
Installable : yes
Binary path : ${exec_prefix}/bin
Lib path : ${exec_prefix}/lib/why3
Data path : ${prefix}/share/why3
Ocaml Library : /home/mark/dev/pr/6329/ocaml/lib/ocaml/why3
Relocatable : no

However, I still get the same error "make: *** No rule to make target `src/coq-tactic/.why3-vo-opt'. Stop.". I think it might be because some dependency of lib/coq-tactic/why3tac.cmxs can't or won't build.

Could you please adjust the repro instructions, and then we can try again? Thanks.

@vicuna
Copy link
Author

vicuna commented May 30, 2014

Comment author: @mshinwell

Ah, the error was caused by missing camlp5...

@vicuna
Copy link
Author

vicuna commented May 30, 2014

Comment author: @mshinwell

OK, so I managed to get that make command to complete successfully.

Maybe you could prepare a tarball that includes all of the sources and objects from a tree where this bug has been observed.

Also, it would be useful if you could get this to fail having altered your OCaml source tree such that instead of just bombing out with an assertion failure, it prints the values of:

  • n
  • Obj.tag n
  • Obj.size n
  • Array.length comps

@vicuna
Copy link
Author

vicuna commented Mar 10, 2015

Comment author: gmelquiond

This bug started popping up again, this time on our regression server. So I took the chance to create a minimal environment for reliably reproducing it. Since the chroot is 1GB-large, I will not post its tarball here. ("Minimal" meant the smallest Debian system able to compile the Why3 plugin nowadays, so it is quite large.)

Due to the high sensitivity to the compilation conditions (see original report: even adding a string causes the bug to vanish), I have not tried modifying the standard library of Ocaml to add some debug printfs. However, I am willing to keep the chroot around and try any non-intrusive method you can think of. In particular, if you can think of a way of recovering n, Obj.tag n, etc from e.g. gdb, please do to tell.

@vicuna
Copy link
Author

vicuna commented May 7, 2015

Comment author: @mshinwell

Can you please put the tarball somewhere on the web such that I can download it, and try to reproduce this problem?

@vicuna
Copy link
Author

vicuna commented Nov 15, 2015

Comment author: @xavierleroy

Any news about this mysterious failure?

@vicuna
Copy link
Author

vicuna commented Feb 11, 2016

Comment author: @damiendoligez

I'm in contact with gmelquiond who will give me the repro file.

@vicuna
Copy link
Author

vicuna commented Apr 4, 2016

Comment author: @damiendoligez

I got the repro file (364 MB).

@vicuna
Copy link
Author

vicuna commented Dec 8, 2016

Comment author: @mshinwell

@doligez Are you able to reproduce this bug using the repro file (which only you have)?

@vicuna
Copy link
Author

vicuna commented Feb 16, 2017

Comment author: @xavierleroy

So, is this issue reproducible or not?

@vicuna
Copy link
Author

vicuna commented Feb 21, 2017

Comment author: @damiendoligez

I'm sorry I dropped the ball on this one. I still have the repro file and I'll try it shortly.

@vicuna
Copy link
Author

vicuna commented Mar 8, 2017

Comment author: @mshinwell

I have successfully reproduced this, and will investigate.

@vicuna
Copy link
Author

vicuna commented Mar 9, 2017

Comment author: @mshinwell

This problem arises because an empty module is passed to [CamlinternalMod.update_mod] as the [n] argument. The empty module lives in the data section of a shared library being loaded via natdynlink. It appears to be defined within another module called Why3.Mlw_ty. By chance, it is the last thing in the "data_begin .. data_end" range for Why3.Mlw_ty. On my system the address of camlWhy3__Mlw_ty__data_end is 0x7fffecda6000; this is also the address of the empty module, and the preceding word is indeed zero (= header for zero-sized array).

When CamlinternalMod.update_mod is in the "Module" case it examines the empty module's tag, using [caml_obj_tag], to ensure it is indeed zero. This should work, because the area should be registered in the page table as static data. Unfortunately it is not; the function returns 1001 meaning "out of heap", and the assertion fails.

This stems from fragility in the page table infrastructure. What should happen is that the entire data_begin .. data_end range for each of the modules in the .cmxs should be registered as static data. Unfortunately the calculation for the above end address goes like this:

(gdb) p/x (0x7fffecda6000 - 1) & (0xffffffffffffffff << 12)
$4 = 0x7fffecda5000

which means that various things, including the aforementioned empty module, won't get recognised as being in static data.

To make this robust I think you'd have to do two things:

  1. Ensure data_begin .. data_end sections are a multiple of Page_size in size;
  2. Ensure that shared libraries get linked such that every data_begin therein lands at an address which is divisible by the page size.

I'm not even sure 2 is possible in general. I would like to propose a perhaps more satisfactory solution which is that I stop procrastinating about #203 with a view to completely removing the page table for 4.06 and imposing the no-naked-pointers conditions by default. We can try to write some kind of helper tool for migration based on 4.05 (maybe a different OPAM switch with more checks). The forthcoming multicore work also has no page table, so this would be consistent with that.

We could change camlinternalMod.ml in 4.05 so that it uses a different function for testing the tag, or we could be brave and temporarily disable the assertion.

@vicuna
Copy link
Author

vicuna commented Mar 9, 2017

Comment author: @mshinwell

Thinking about this a bit more, maybe it isn't as bad as all that (although I still favour the proposed solution above). The situation probably only arises because there's an empty block immediately before "_data_end". So adding a padding word at that point would likely fix the problem, because the subsequent page would get registered in the table.

I suppose in terms of the more general issue: we can assume that shared libraries won't be loaded at an address that isn't zero modulo the OS page size, as memory protection needs to be applied (and that presumably works in page units). So we might overshoot when registering one data_begin .. data_end pair in two ways: firstly, into another such pair---which is fine---or secondly, off the end of the data section of the shared library itself. However we shouldn't go further than the next OS page boundary if that condition holds. So maybe it's ok, just.

@vicuna
Copy link
Author

vicuna commented Jul 19, 2017

Comment author: @alainfrisch

Mark: do you think you can propose a fix in time for 4.06's freeze date?

@vicuna
Copy link
Author

vicuna commented Sep 20, 2017

Comment author: @xavierleroy

An inglorious quick fix would be to remove the assertion "Obj.tag n = 0" in the "Module" case of "update_mod".

In the longer term @shinwell is right that an extra world of padding at the end of the data segment can make sense.

@vicuna
Copy link
Author

vicuna commented Sep 20, 2017

Comment author: @mshinwell

Sorry I completely missed @Frisch's ping on this one. I will try to do something for 4.06.

@vicuna
Copy link
Author

vicuna commented Oct 19, 2017

Comment author: @gasche

Ping again :-)

@vicuna
Copy link
Author

vicuna commented Oct 19, 2017

Comment author: @mshinwell

I will try to look at this tomorrow (Friday)

@vicuna
Copy link
Author

vicuna commented Oct 20, 2017

Comment author: @mshinwell

Superceded by #1437

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants