Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006501OCamlOCaml backend (code generation)public2014-07-29 00:112014-08-11 09:50
ReporterAntoine Mine 
Assigned To 
PrioritynormalSeverityfeatureReproducibilityalways
StatusresolvedResolutionfixed 
Platformx86_64OSLinuxOS Version
Product Version4.02.0+beta1 / +rc1 
Target Version4.02.1+devFixed in Version4.03.0+dev 
Summary0006501: Implementing Z.of_int as %identity in Zarith (was: zarith sample code segfaults)
DescriptionThis is an upstream from Zarith's bug 0001409
https://forge.ocamlcore.org/tracker/index.php?func=detail&aid=1409&group_id=243&atid=1095 [^]

The client code bug.ml (in Zarith's bug report) uses the Z module compiled by Zarith into zarith.cmxa. When compiling bug.ml with ocamlopt and the z.cmx file from the compilation of Zarith is present, the code segfaults in the GC:

#0 0x000000000044fc28 in invert_pointer_at ()
#1 0x000000000044fcb3 in invert_root ()
#2 0x0000000000441ef0 in caml_do_roots ()
0000003 0x000000000044fd63 in do_compaction ()
0000004 0x00000000004501e8 in caml_compact_heap ()
0000005 0x000000000045056d in caml_compact_heap_maybe ()
0000006 0x0000000000443507 in caml_major_collection_slice ()
0000007 0x0000000000443c71 in caml_minor_collection ()
0000008 0x00000000004429f5 in caml_garbage_collection ()
0000009 0x000000000045179c in caml_call_gc ()
0000010 0x0000000000432147 in camlHashtbl__mem_1128 () at hashtbl.ml:181
0000011 0x000000000041054d in camlBug__fun_1093 ()
0000012 0x000000000041068b in camlBug__count_1066 ()
0000013 0x000000000041057f in camlBug__fun_1093 ()
0000014 0x000000000041057f in camlBug__fun_1093 ()
0000015 0x00000000004105c6 in camlBug__count_1057 ()
0000016 0x000000000041057f in camlBug__fun_1093 ()
0000017 0x0000000000410ae5 in camlBug__entry ()
0000018 0x000000000040e689 in caml_program ()
0000019 0x000000000045197e in caml_start_program ()
0000020 0x0000000000000000 in ?? ()

Well, it might be a problem in Zarith, which does magic mixing of OCaml, C and asm.
But now for the weird part. If the z.cmx file is not present when bug.ml is compiled, the program does not segfault. This is always reproducible on my machine. I've checked that the code generated by ocamlopt in both cases is actually quite different.
Steps To Reproduce1) Download zarith and unpack it
https://forge.ocamlcore.org/frs/download.php/1199/zarith-1.2.1.tgz [^]

2) ./configure and make it

3) download bug.ml and copy it in zarith source
https://forge.ocamlcore.org/tracker/download.php/243/1095/1409/261/bug.ml [^]

4) check that z.cmx is in the working directory as well as bug.ml

5) compile with "ocamlopt zarith.cmxa bug.ml -cclib -L."

6) ./a.out segfaults

7) remove z.cmx

8) recompile with "ocamlopt zarith.cmxa bug.ml -cclib -L."

9) ./a.out now works
TagsNo tags attached.
Attached Files

- Relationships

-  Notes
(0011934)
dim (developer)
2014-07-29 10:48

When z.cmx is present functions from the Z module might be inlined.

I diffed the two clambda and there are actually very few differences. Only [Z.zero] and [Z.of_int] are inlined.
(0011935)
Boris Yakobowski (reporter)
2014-07-29 19:16
edited on: 2014-07-29 19:21

The bug disappears (and the code becomes faster :-)) when the following definition is used for memoize

let memoize f =
  let cache = Hashtbl.create 100 in
  fun n ->
    try Hashtbl.find cache n
    with Not_found ->
      let x = f n in
      Hashtbl.add cache n x;
      x

However, the real problem probably lies in the loop

      let total = ref Z.zero in
      for fst_size = 0 to size do
        let snd_size = size - fst_size in
        let fst_count = fst_count fst_size in
        let snd_count = snd_count snd_size in
        let count = Z.( * ) fst_count snd_count in
        total := Z.(+) !total count
      done;
      !total

If total is instead initialized to Z.of_int 1, and the code returns 'Z.sub !total (Z.of_int 1)' instead, the segfault disappears. It is likely that 'total' changes from OCaml integers to Zarith internal representation at some point, which confuses either the code generator or the GC.

EDIT: printing the linear code with initial value 0 and 1 shows that very little changes between the two versions. Either the problem is in the GC, or this is the wrong track.

(0011936)
Antoine Mine (reporter)
2014-07-29 19:54

Thanks for the bug analysis. Indeed, Z.of_int (and co.) seems to be the culprit. It is defined as %identity, which was probably a bad idea. Redefining it as a C external "ml_z_of_int" solves the issue. I think I will go with this solution and patch Zarith.

Nevertheless, I wonder if there is some relationship with Lazy.t as: 1) bug.ml is the first program I've seen that mixes Lazy and Zarith, 2) I've never had problems with Zarith for a long long time, 3) both do colorful magic mixing values and blocks.

A wild guess is that OCaml performs type-based optimizations (or, more precise, using the abstraction "is a scalar", "is a block", "is polymorphic") and uses the knowledge that a mutable object's initial value is the result of calling %identity on an integer. However, the type of Z.of_int is "int -> t" where t is abstract, which may (should?) prevent type propagation.
(0011938)
Boris Yakobowski (reporter)
2014-07-30 09:59

My remark about replacing 'ref Z.zero' by 'ref (Z.of_int 1)' may actually be a red herring. When doing so, the arguments to some GC functions (prior to the crash) do change in a significant way. It is theoretically possible that the switch from OCaml short integers to Zarith long integers occur earlier because of this. This would change the allocation profile of the code.
(0011939)
dim (developer)
2014-07-30 11:09

The typer doesn't really cares about the fact that "Z.of_int" is declared as an external. This is used by later phases in the compiler, so I don't think that's the problem. The assembly code has very few differences between the inlined and non-inlined versions anyway.

Lazy shouldn't be a problem here. I tried replacing the lazy by a ref and the bug is still there.

I played a bit with the configuration in caml_z.c, setting Z_PERFORM_CHECK to 1 produces a different error:

  ml_z_check: wrong custom block for arg2 at ml_z_mul:1339

also reducing the size of the minor heap make the bug happen much faster, for instance with OCAMLRUNPARAM=s=256. The bug always happen some time after the first minor collection, so something might get corrupted during the promotion.
(0011940)
dim (developer)
2014-07-30 12:50

I spoke a bit too fasst. Looking at -dlive total is indeed considered as always an integer and so ignored by the gc. ocamlobjinfo shows that [Z.zero] is an integer constant and not a pointer constant which is probably the reason.
(0011943)
xleroy (administrator)
2014-07-30 17:48

I confirm that the problem is a missing GC root. A bignum gets allocated, then freed by a minor collection, then used again as argument to Z.mul. The typing issue mentioned by "dim" is very likely the culprit. This could be a misuse of "%identity" as mentioned by Antoine, or something else. I'll investigate more.
(0011944)
xleroy (administrator)
2014-07-30 19:01

Indeed, the problem lies with "%identity" when it is used to inject integers into a type that can also contain heap pointers. A similar issue can be observed with Obj.magic, which is also "%identity":

let f n =
  let r = ref (Obj.magic 0) in (* instead of "ref []" *)
  for i = 1 to n do
    r := String.create 200 :: !r;
    ignore (List.length !r) (* scan the list to see if it is consistent *)
  done;
  ()

let _ = f 10000

This segfaults shortly after the first GC, because r is not a root, Why is it not a root? Because it's a local mutable variable (after let-ref optimization) that is initialized with an integer (a C-- Cconst_int, not a C-- Cconst_pointer). So, it has machine type "int" and not "address", and is excluded from the roots given to the GC.

The safe, short-term fix is to stop implementing "Z.of_int" as "%identity" and use a C function instead, as Antoine suggested.

Additionally, Zarith could define

type t' = Zero | One | Other of Obj.t

let zero : t = Obj.magic Zero
let one : t = Obj.magic One

so that zero, one would be constants of machine type address. But it doesn't work for minus_one nor for of_int in general.

In the longer term, there might be better solutions with some changes to the ocamlopt compiler. One would be a primitive like "%identity" that forces its result to have machine type "address". Unlike "%identity", it would have to be kept by the front-end all the way to C--. Another solution would be to treat mutable let-bound variables differently during C-- to Mach compilation, so that their machine type is inferred not just from their initial value but from all their assignments as well.
(0011945)
xleroy (administrator)
2014-07-30 19:09

I reclassify this PR as a feature wish so that it doesn't block the release of OCaml 4.02.
(0011946)
Antoine Mine (reporter)
2014-07-30 20:41

Thank you very much for your help and the clear explanation. I've patched Zarith to use a C external instead of %identity for of_int. It works well.
(0011999)
xleroy (administrator)
2014-08-11 09:50

Tentative fix in SVN trunk, commit r15078. The machine type of a C-- mutable variable is now the lub of the types of its initial value and all the values assigned to it, with type Int being a subtype of Addr. In the problematic examples from this PR, it causes the mutable variables to have type Addr and be treated as GC roots (correctly).

- Issue History
Date Modified Username Field Change
2014-07-29 00:11 Antoine Mine New Issue
2014-07-29 10:48 dim Note Added: 0011934
2014-07-29 19:16 Boris Yakobowski Note Added: 0011935
2014-07-29 19:21 Boris Yakobowski Note Edited: 0011935 View Revisions
2014-07-29 19:21 Boris Yakobowski Note Edited: 0011935 View Revisions
2014-07-29 19:54 Antoine Mine Note Added: 0011936
2014-07-30 09:59 Boris Yakobowski Note Added: 0011938
2014-07-30 11:09 dim Note Added: 0011939
2014-07-30 12:50 dim Note Added: 0011940
2014-07-30 17:48 xleroy Note Added: 0011943
2014-07-30 17:48 xleroy Status new => confirmed
2014-07-30 19:01 xleroy Note Added: 0011944
2014-07-30 19:09 xleroy Note Added: 0011945
2014-07-30 19:09 xleroy Severity minor => feature
2014-07-30 19:09 xleroy Target Version => 4.02.1+dev
2014-07-30 19:09 xleroy Summary zarith sample code segfaults, but only when compiled with z.cmx present => Implementing Z.of_int as %identity in Zarith (was: zarith sample code segfaults)
2014-07-30 20:41 Antoine Mine Note Added: 0011946
2014-08-11 09:50 xleroy Note Added: 0011999
2014-08-11 09:50 xleroy Status confirmed => resolved
2014-08-11 09:50 xleroy Resolution open => fixed
2014-08-11 09:50 xleroy Fixed in Version => 4.03.0+dev


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker