Methods call are 2 times slower with the 4.0 native compiler code than with the 3.12 native compiler code #5674

vicuna · 2012-07-07T16:15:30Z

Original bug ID: 5674
Reporter: giavitto
Assigned to: @garrigue
Status: closed (set by @garrigue on 2012-07-10T08:35:03Z)
Resolution: fixed
Priority: normal
Severity: minor
Platform: MAC
OS: Mac OS X
OS Version: 10.7.4
Version: 4.00.0+beta2/+rc1
Target version: 4.00.0+dev
Fixed in version: 4.00.0+dev
Category: back end (clambda to assembly)
Monitored by: @gasche @hcarty

Bug description

I notice a slow down of several of my programs that use objects. The attached file shows an example that exhibits a slowdown of factor 2 between the native compilation with 3.12 and the native compilation with 4.0.

This slowdown exists also for byte-code but with a factor of 1.3.

Steps to reproduce

Compile and run the attached file with the various versions of the compiler. The program must be linked with the Unix library to print de time usage.

On my mac it produces:

The OCaml compiler, version 4.00.0+beta2

resultat 98765432 in 37.940392

The OCaml native-code compiler, version 4.00.0+beta2

resultat 98765432 in 1.009058

The Objective Caml compiler, version 3.12.1+rc1

resultat 98765432 in 28.797747

The Objective Caml native-code compiler, version 3.12.1+rc1

resultat 98765432 in 0.535362

File attachments

vicuna · 2012-07-07T17:15:55Z

Comment author: @gasche

I could partially reproduce the bug, with strange result:

with 3.12.1, the test takes roughly 0.5
on SVN trunk, the test takes 1.0s as reported
I backtracked before the merge of the binannot branch, and the test seems to take 0.9s rather reliably

So we might be looking for several performance-affecting changes, some more severe than others.

vicuna · 2012-07-07T17:29:22Z

Comment author: @gasche

I could reproduce the bug with revision 11121 of the SVN:

The OCaml native-code compiler, version 3.13.0+dev4 (2011-06-20)

vicuna · 2012-07-07T22:25:44Z

Comment author: @gasche

With the bytecode compiler, the performance story is different. The test case above terminates in 5s on my machine just before the merging of bin-annot (like for 3.12), and in 10s just after. There is clearly a performance regression introduced here, only it is very sensible in bytecode and not so much on native code.

Manual inspection of the -dlambda output reveals that the method definitions are no longer curried; this seems to be a kind of arity-raising problem. The following kind code was emitted before the bin-annot merge:

       (apply (field 10 (global CamlinternalOO!)) class/1066
              (makeblock 0 m1/1035
                (function self-1/1041 x/1042
                  (seq (+:=1 (field 0 (global Test2!))) x/1042))
                 ....))

While the following is emitted just after the merge:

            (apply (field 10 (global CamlinternalOO!)) class/1044
              (makeblock 0 m1/1013
                (function self-1/1019
                  (function x/1020
                    (seq (+:=1 (field 0 (global Test2!))) x/1020)))
                 ...))

My guess is that the real performance regression is introduced here (in the giant bin-annot patch), while other changes also influence the native measurement.

vicuna · 2012-07-08T06:19:04Z

Comment author: @garrigue

I've tried fixing this without trying to understand what binannot did,
just by merging curried lambdas in Simplif.simplify_lets.
The patch is attached to this report.

This seems to completely fix the problem for native code: I get exactly the same speed as with 3.12.
For bytecode, I still get a 10% slowdown, but this seems small enough.

Anybody sees a problem in merging this patch?

vicuna · 2012-07-08T06:42:38Z

Comment author: @garrigue

I've looked at the -dlambda code after with the patch, and the only difference seems to be whether an array of strings is shared or not. As this cannot make such a big difference (it is only used for class initialization), the source should rather be in the bytecode interpreter. However I see no change there either...
After more tests, it might just be the C compiler used for compiling ocamlrun.
I.e., the ocamlrun 3.12.1 compiled by hand is slower than the one from macports.

vicuna · 2012-07-08T13:45:12Z

Comment author: @gasche

I think I now understand where the bug comes from.

The "currification" of consecutive abstractions is performed in
bytecomp/translcore.ml:transl_function:

and transl_function loc untuplify_fn repr partial pat_expr_list =
match pat_expr_list with
[pat, ({exp_desc = Texp_function(, pl,partial')} as exp)]
when Parmatch.fluid pat ->
let param = name_pattern "param" pat_expr_list in
let ((, params), body) =
transl_function exp.exp_loc false repr partial' pl in
((Curried, param :: params),
Matching.for_function loc None (Lvar param) [pat, body] partial)
| ({pat_desc = Tpat_tuple pl}, _) :: _ when untuplify_fn ->
[...]
let params = List.map (fun p -> Ident.create "param") pl in
((Tupled, params),
Matching.for_tupled_function loc params
(transl_tupled_cases pats_expr_list) partial)
[...]
| _ ->
let param = name_pattern "param" pat_expr_list in
((Curried, [param]),
Matching.for_function loc repr (Lvar param)
(transl_cases pat_expr_list) partial)

This pattern matches the exact Texp_function constructor to use
a Curried function. The functions of interest in our case where
produced by typing/typeclass.ml:make_method:

let make_method self_loc cl_num expr =
let mkpat d = { ppat_desc = d; ppat_loc = self_loc } in
let mkid s = mkloc s self_loc in
{ pexp_desc =
Pexp_function ("", None,
[mkpat (Ppat_alias (mkpat (Ppat_var (mkid "self-*")),
mkid ("self-" ^ cl_num))),
expr]);
pexp_loc = expr.pexp_loc }

This code is produced during method typing (as a desugaring of "method
foo = ..."), then typed, which produces a Texp_function, whose
translation to Lambda-code in turns call the transl_function
function above. For currifycation to be performed, we need to make
sure that the method body (the expr argument of make_method) has
Texp_function as head constructor.

But the parser does not produce a function directly:

concrete_method :
METHOD override_flag private_flag label strict_binding
{ mkloc $4 (rhs_loc 4), $3, $2, ghexp(Pexp_poly ($5, None)) }

There is this silent Pexp_poly constructor that, in this case, carries
no type information (it is used for polymorphic methods to carry the
explicit polymorphic annotation). The content of Pexp_poly,
$5, is indeed a Pexp_function in our case.

Those Pexp_poly constructors used to be erased during the typing phase.
The bin-annot merge has added a new Texp_poly constructor to preserve
that information explicitly in the typedtree. In the bin-annot patch:

| Pexp_poly(sbody, sty) ->
[...]

```
begin
```

let exp =
  match (expand_head env ty).desc with
    Tpoly (ty', []) ->
      let exp = type_expect env sbody ty' in
      re { exp with exp_type = instance env ty }
  | Tpoly (ty', tl) ->
      [...]
      re { exp with exp_type = instance env ty }
  | Tvar _ ->
      let exp = type_exp env sbody in
      let exp = {exp with exp_type = newty (Tpoly (exp.exp_type, []))} in
      unify_exp env exp ty;
      re exp
  | _ -> assert false

```
end
```

```
in
```

re { exp with exp_desc = Texp_poly(exp, cty) }

So here is the problem: with the extraneous Texp_poly constructor
inserted in the AST, a simple optimizations at the typedtree level
that proceeded by naive matching of the AST doesn't fire anymore.

I'm not sure what is the best way to solve the issue. Jacques solution
of re-implementing the currifying optimization at the Lambda level
seems to be a good approach: Texp_poly are erased during the
translation to lambda-code, so it will work, should be robust to
further changes of the typedtree format. At the same time, I have two
small concerns with this:

I am not sure Jacques simple implementation is the correct one
(it seems to neglect the Curried/Tupled information of the outer
function, maybe this is not right)
I suspect the reason why this transformation was done at the
typedtree level might be a synergy with pattern matching
compilation, which is also done at the typedtree -> lambdacode
transition. If that was the case, relocating the optimization at
a lower level (with no pattern information) could result in
performance degradation in some cases.

vicuna · 2012-07-08T20:41:23Z

Comment author: @garrigue

I am not sure Jacques simple implementation is the correct one
(it seems to neglect the Curried/Tupled information of the outer
function, maybe this is not right)

The kind of the outer function is checked in the when clause of the inner match.

I suspect the reason why this transformation was done at the
typedtree level might be a synergy with pattern matching
compilation, which is also done at the typedtree -> lambdacode
transition. If that was the case, relocating the optimization at
a lower level (with no pattern information) could result in
performance degradation in some cases.

Right, for native code one may wrongly create a tuplified function.
So removing the Texp_poly node (moving it to exp_extra) seems a better solution.

vicuna · 2012-07-08T21:20:17Z

Comment author: @garrigue

I added a patch exp_extras.diffs, moving the Texp_poly and Texp_newtype nodes to exp_extra.
Also fixed untypeast.ml and typedtreeIter.ml.

vicuna · 2012-07-10T08:35:03Z

Comment author: @garrigue

Fixed by apply the patch exp_extras.diffs, which moves Texp_poly and Texp_newtype to exp_extra.
This way binannot does not introduce new nodes anymore.
Committed in trunk and 4.00, revisions 12680 and 12681.

vicuna closed this as completed Jul 10, 2012

vicuna added the back-end label Mar 14, 2019

vicuna added this to the 4.00.0 milestone Mar 14, 2019

vicuna assigned garrigue Mar 14, 2019

vicuna added the bug label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Methods call are 2 times slower with the 4.0 native compiler code than with the 3.12 native compiler code #5674

Methods call are 2 times slower with the 4.0 native compiler code than with the 3.12 native compiler code #5674

vicuna commented Jul 7, 2012

vicuna commented Jul 7, 2012

vicuna commented Jul 7, 2012

vicuna commented Jul 7, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 10, 2012

Methods call are 2 times slower with the 4.0 native compiler code than with the 3.12 native compiler code #5674

Methods call are 2 times slower with the 4.0 native compiler code than with the 3.12 native compiler code #5674

Comments

vicuna commented Jul 7, 2012

Bug description

Steps to reproduce

resultat 98765432 in 37.940392

resultat 98765432 in 1.009058

resultat 98765432 in 28.797747

resultat 98765432 in 0.535362

File attachments

vicuna commented Jul 7, 2012

vicuna commented Jul 7, 2012

vicuna commented Jul 7, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 8, 2012

vicuna commented Jul 10, 2012