Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0004800OCamlOCaml generalpublic2009-05-20 11:562013-09-04 18:11
Reporterjerhoud 
Assigned To 
PrioritynormalSeverityfeatureReproducibilityN/A
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version3.10.2 
Target VersionFixed in Version 
Summary0004800: native compilation optimization of tuple assignment
DescriptionVery often I need to define several variables according to a condition (or a pattern matching), so I write something like :

let (a, b) = if cond then (a1, b1) else (a2, b2) in ...

or

let (a, b) =
  match value with
    | case1 -> (a1, b1)
    | case2 -> (a2, b2)
    | ...
in ...

In these situation, the compiler allocates memory and creates a tuple. This is quite inefficient. I would like the compiler to optimize this allocation away.
Tagspatch
Attached Filespatch file icon 0001-optimization-unboxing-of-let-bound-tuples.patch [^] (6,445 bytes) 2012-01-19 18:46 [Show Content]
patch file icon 0001-avoid-float-unboxing-when-uneeded.patch [^] (2,557 bytes) 2012-01-20 19:50 [Show Content]

- Relationships
related to 0005879acknowledged Static exception handlers (i.e. well-disciplined gotos!) 

-  Notes
(0005533)
frisch (developer)
2010-05-28 17:18

The experimental branch inplace_let, which you can find in the OCaml Subversion repository, implements the suggested optimization. Feel free to try it and report any performance difference you could observe.
(0006332)
frisch (developer)
2011-12-16 18:15

I've done some micro benchmarks.

Consider the code below:

let () =
  let r = ref 0 in
  for k = 1 to 10000 do
    for i = 1 to 100000 do
      let (x, y) =
        if i mod 2 = 0 then (1, i * 2)
        else (2, i * 3)
      in
      r := !r * x + y
    done;
  done;
  Printf.printf "-> %i\n%!" !r

and the alternative version:

...
      let b = i mod 2 = 0 in
      let x = if b then 1 else 2 in
      let y = if b then i * 2 else i * 3 in
...

The timings are:
  3.80s without the patch
  3.25s for the alternative version (with and without the patch)
  2.60s with the patch



Another test:

let f x y z =
  let a, b =
    match z with
    | Some (u, v) -> u, v * 2
    | None -> 10, 20
  in
  a * x + b * y

let () =
  let r = ref 0 in
  for k = 1 to 2000 do
    for i = 1 to 100000 do
      r := !r + f k i (Some (k, i));
      r := !r + f k i None
    done;
  done;
  Printf.printf "-> %i\n%!" !r


The timings are:
  1.80s without the patch
  1.60s with the patch


I've also found a case where the patch degrades performances:

let () =
  let x = ref 0. in
  let y = ref 0. in
  for k = 1 to 2000 do
    for i = 1 to 100000 do
      let x, y =
        if i mod 2 = 0 then x, y else y, x
      in
      x := !x +. 1.;
      y := !y -. 1.;
    done;
  done;
  Printf.printf "-> %f, %f\n%!" !x !y


The timings are:
  4.15s without the patch
  4.60s with the patch

If we replace floats with ints, the patch improves performances:
  0.77s without the patch
  0.51s with the patch

The following variant:

let () =
  let x = ref 0. in
  let y = ref 0. in
  for k = 1 to 2000 do
    for i = 1 to 100000 do
      let x, y =
        if i mod 2 = 0 then x, y else y, x
      in
      x := !x +. 1.;
      y := !y -. 1.;
      x := !x +. 1.;
      y := !y -. 1.;
    done;
  done;
  Printf.printf "-> %f, %f\n%!" !x !y

is again faster with the patch:

  8.32s without the patch
  7.52s with the patch

and so is this one:

let () =
  let x = ref 1. in
  let y = ref 2. in
  for k = 1 to 2000 do
    for i = 1 to 100000 do
      let x, y, z =
        if i mod 2 = 0 then
          x, y, 1.
        else
          y, x, 2.
      in
      x := !x +. z;
      y := !y +. 1.;
    done;
  done;
  Printf.printf "-> %f, %f\n%!" !x !y


  4.89s without the patch
  3.81s with the patch


It seem that most of the time, we get significant speedups. Concerning the slowdown: I could not tell from the generated assembly code where it comes from; probably some cache or alignment issue.
(0006407)
gasche (developer)
2011-12-20 11:54
edited on: 2011-12-20 11:59

A did a quick review.

High-level points:
1. I'm not sure why you decided to support tuples only, and not records (float boxing issues? see below). Similarly, why doesn't transform_return support Lswitch? Ltrywith?
2. In the optimization case, you initialize your variables with `Const_pointer 0`. This relies on the non-local invariant that the pattern optimized only match one-word data. You should probably add a comment justifying the safety of this initialization.
3. You did not add tests to the testsuite. It's always better when patches come with test coverage. Adding tests is boring stuff, and doing it bit by bit when you write a patch on a specific aspect of the compiler is a good routine to have. Plus you sometimes find bugs.

Nitpicking:
- I'm not fond of the silent shadowing of for_let
- It's not clear why one `seq` call needs List.rev and not the other. If there is some evaluation order to respect, you should comment about it. Maybe this would be simpler if assign_pat just returned a list of side-effects to apply? Plus that would potentially result in slightly better results (less `seq []` cases).
- in `assign_pat` you handle Tpat_var specifically but not Tpat_any; I'm under the impression that both would "do the right thing" in the default case; why did you choose to single out one and not the other?
- The control flow of assign_pat is convoluted and non-optimal; in particular, the default case will compute `for_let loc lam pat' (copy_ids ids)` even when the optimization doesn't apply and this is therefore discarded. I would rather suggest the following flow: make the reference local to the function (this simplifies the interface) and, in the wildcard case, raise Exit if !opt if false.

PS: it's easier if you give an URL to the diff(s).
  http://caml.inria.fr/cgi-bin/viewvc.cgi/ocaml/branches/inplace_let/bytecomp/matching.ml?r1=10475&r2=10474&pathrev=10475 [^]

(0006409)
frisch (developer)
2011-12-20 12:50

Hi Gabriel,

Thanks for the detailed feedback.

1. Tuple only: the typical idiom is to write

     let (x, y) = if ... then (e1, e2) else (e3, e4)

   The intention is not really to create tuples, but to return several bindings.
   I don't think people explicitly create records to deconstruct them immediately.

   Lswitch, Ltrywith would be easy to handle, indeed.

2. You're right that the initialization is not very nice. I'm not 100% sure this is safe, actually. I propose to wait to see if there is some interest in the patch (i.e. if it is considered useful) before investigating the actual implementation.

I don't understand your last proposal about assign_pat; the wildcard case in assign_pat does not, by itself, disallow the optimization (opt can be set to true afterwards).
(0006414)
gasche (developer)
2011-12-20 14:20

Here is the sort of things I had in mind. I didn't even try to compile the code, it's just a draft. I may also have misunderstood your code (in which case I'd welcome clarifying in-code comments) in which case this wouldn't be equivalent.

let assign_pat loc pat lam =
  let rec assign opt pat lam =
    match pat.pat_desc, lam with
      | Tpat_tuple patl, Lprim(Pmakeblock _, lams) ->
        List.concat (List.map2 (assign true) patl lams)
      | Tpat_tuple patl, Lconst(Const_block (_, scl)) ->
        List.concat (List.map2 (fun p sc -> assign true p (Lconst sc)) patl scl)
      | _ ->
        if not opt then raise Exit;
        match pat with
          | Tpat_any -> []
          | Tpat_var id -> [Lassign(id,lam)]
          | _ ->
            let ids = pat_bound_idents pat in
            let ids = List.map (fun id -> id, Ident.rename id) ids in
            let pat' = alpha_pat ids pat in
            [for_let loc lam pat' (copy_ids ids)]
  in assign false pat lam
(0006416)
frisch (developer)
2011-12-20 14:43

The behavior is different.

The original code "pushes" the bindings down the bound expression, duplicating it on branching node. If one of the bindings "cancels" out a literal tuple, the optimization is triggered. This is achieved by sharing the reference across all the calls to assign_pat. For instance, consider:

  let x, y =
     if ... then z else (1, 2)

Here, transform_return will push the binding (x,y) to the two branches; on the second fails, the corresponding call to assign_pat detects that the optimization operates. But on the first call to assign_pat (for "z"), we need the fallback case.
(0006418)
gasche (developer)
2011-12-20 15:19

Indeed, that's... wicked.

I hope you'll publish a testsuite with all those interesting corner cases covered and explained. Maybe an in-code comment is also good.

I'm wondering however, could this example result in a performance degradation? Imagine that "z" is the usual case and (1,2) is an exceptional case (that's how I expect this pattern to be used). The result of your optimization, if I'm not mistaken, will allocate two dummy-initialized variables x and y, get the z couple, set x to z.fst and y to z.snd, and go on. The current compiler would just get z, bind x to z.fst and y to z.snd, and move on.

The difference is tiny (in particular, of course you don't allocate more) but it would be interesting to know if the difference is noticeable. Maybe it would be safer to optimize only when all branches are tuples or do not return (assert false, raise...).

A good thing with either choices is that the performance-maniac compiler-guru user can tune its code to get the semantics he wants: by turning z into `fst z, snd z` in your example he could ensure the restricted optimization fires, and by writing `let (x, y) as _c = ...` he can disable the optimistic one.


PS: I understand that you don't want to invest more time on this until the usefulness of the optimization is demonstrated and accepted. My technical comments on the code should not be taken as an opinion.
(0006735)
gasche (developer)
2012-01-19 18:52
edited on: 2012-01-19 18:57

While looking at Alain's implementation again, I had an idea of a different implementation of the optimization. I implemented it as the attached patch.

The idea is to use Lstaticcatch/Lstaticraise instead of Lassign.

Note: I have added comments liberally, but barring that the patches are of the exact same size.

Alain patch transforms

     let (x,y) =
        let foo = ... in
        if foo then (1, 2) else (3,4)
     in bar

into (approximatively; in this special case there is no name freshening)

     let x = dummy in let y = dummy in
     begin
      let foo = ... in
      if foo then
        (let x1 = 1 in let y1 = 2 in x <- x1; y <- y1)
      else
        (let x2 = 3 in let y2 = 4 in x <- x2; y <- y2)
     end;
     bar

While my implementation produces

     catch
       let foo = ... in
       if foo then
         (let x1 = 1 in let y1 = 2 in exit x1 y1)
       else
        (let x2 = 3 in let y2 = 4 in exit x2 y2)
     with x y ->
       bar

I liked the use of control-flow instead of mutation here -- and the removal of those unsatisfying "dummy initialization".

The performance profile is different from Alain's, but related. This patch tends to do slightly better in some cases, and slightly worse in others. See at the end of this message the compared timings on the micro-benchmarks posted by Alain above.

I believe the implementation is more efficient in general, but maybe does not benefit from float or reference unboxing optimization in Lassign, and is therefore slower in this case -- I'm not familiar with unboxing optimizations, but that's the most plausible justification I have; I'd be glad to confirmed or contradicted.
If this is the case, then I'm wondering whether float unboxing could be performed on catch/exit as well.


Performance numbers on Alain's tests, in the order corresponding to his message.

test1.ml (integers)
  3.792s no-opt
  2.694s assign-opt
  2.422s exit-opt

test2.ml (integers)
  1.979s no-opt
  1.644s assign-opt
  1.551s exit-opt

test3.ml (float references)
  4.648s no-opt
  4.783s assign-opt
  4.691s exit-opt

test4.ml (float references)
  8.871s no-opt
  7.779s assign-opt
  8.843s exit-opt

test5.ml (float references)
  5.334s no-opt
  4.026s assign-opt
  5.272s exit-opt

(0006758)
frisch (developer)
2012-01-20 15:13

Gabriel, I really like your proposal. Indeed, it's much cleaner to use control-flow rather than mutation.
(0006763)
gasche (developer)
2012-01-20 20:01

I have explored the boxing issue a bit. I just uploaded an experimental patch (very lightly tested) that fixes the issue I could identify by reading the cmm code.

Unfortunately, in the process of testing this patch on my examples, I discovered that my measurements are much more variable that I thought : on the test machine, I get upto 1 second variation with the same optimizer, which makes 4-seconds measurements unusable...

I will try to find more robust measurements, but in the meantime, here are the relevant region of the cmm code on test5.ml.

No optimization at all ('match/1037' is the identifier of the tuple
(let (x,y) = ...); 'x' and 'y' are inlined):

(data global "camlTest5__3" int 1277 "camlTest5__3": double 1.)
(data global "camlTest5__4" int 1277 "camlTest5__4": double 2.)

  (let
    (match/1037
       (if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
         (alloc 3072 x/1030 y/1031 "camlTest5__3")
         (alloc 3072 y/1031 x/1030 "camlTest5__4"))
     y/1035 (load (+a match/1037 8))
     x/1034 (load match/1037))
     (...))

With Alain's assign patch:

(data global "camlTest5__3" int 1277 "camlTest5__3": double 1.)
(data global "camlTest5__4" int 1277 "camlTest5__4": double 2.)

  (let (x/1034 1a y/1035 1a z/1036 1a)
    (if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
      (seq (assign z/1036 "camlTest5__3")
        (assign y/1035 y/1031) (assign x/1034 x/1030))
      (seq (assign z/1036 "camlTest5__4")
        (assign y/1035 x/1030) (assign x/1034 y/1031)))
    (...))

With the catch/exit patch:

  (catch
    (if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
      (let (z/1048 1. z/1042 (alloc 1277 z/1048))
        (exit 1 z/1042 y/1031 x/1030))
      (let (z/1047 2. z/1039 (alloc 1277 z/1047))
        (exit 1 z/1039 x/1030 y/1031)))
  with(1 z/1036 y/1035 x/1034)
    (...))

Notice the degradation with 'z' that is explicitely allocated instead
of being a global symbol. That's the defect I tried to fix with the
new (independent) patch.

With my 'avoid undeeded unboxing' patch:

(data global "camlTest5__3" int 1277 "camlTest5__3": double 1.)
(data global "camlTest5__4" int 1277 "camlTest5__4": double 2.)

  (catch
    (if (== (+ (<< (mod (>>s i/1033 1) 2) 1) 1) 1)
      (let z/1042 "camlTest5__3"
        (exit 1 z/1042 y/1031 x/1030))
      (let z/1039 "camlTest5__4"
        (exit 1 z/1039 x/1030 y/1031)))
  with(1 z/1036 y/1035 x/1034)
    (...))
(0008562)
frisch (developer)
2012-12-04 17:42
edited on: 2012-12-04 17:44

I believe the second patch (0001-avoid-float-unboxing-when-uneeded.patch) is wrong. If a mutable variable is needed in both boxed and unboxed form, one should not use the unboxed compilation scheme (otherwise assignments to the variable are only visible to context which use the unboxed form).

(0008563)
frisch (developer)
2012-12-04 18:01

FWIW, I could not observe any significant impact on performance of applying Gabriel's first patch (but not the second one) for our entire test suite (maybe because we avoid the optimized construction in performance critical code). There is a small reduction of the total number of allocated bytes (e.g. for some tests, around 3% fewer allocated bytes).
(0008751)
frisch (developer)
2013-01-15 13:38
edited on: 2013-01-15 13:44

With static exceptions (0005879), it is possible to code by hand Gabriel's solution. For instance:

let () =
  let r = ref 0 in
  for k = 1 to 10000 do
    for i = 1 to 100000 do
      try
        if i mod 2 = 0 then raise (`Cont (1, i * 2))
        else raise (`Cont (2, i * 3))
      with `Cont (x, y) ->
        r := !r * x + y
    done;
  done;
  Printf.printf "-> %i\n%!" !r


and this achieves identical performances. Of course, it is not very nice to write code like that, but it's good to know that even if no optimization for tuple assignment is ever included, static exceptions give a way to improve speed of performance critical code by hand.

Another observation is that the changes required to support escaping try...with blocks with a static exception would benefit to Gabriel's approach as well, making it possible to optimize the case where the right-hand side build tuples in terminal position of a try...with body.


- Issue History
Date Modified Username Field Change
2009-05-20 11:56 jerhoud New Issue
2009-05-20 14:11 doligez Status new => acknowledged
2010-05-28 17:18 frisch Note Added: 0005533
2010-05-28 17:20 frisch Status acknowledged => assigned
2010-05-28 17:20 frisch Assigned To => frisch
2011-12-16 18:15 frisch Note Added: 0006332
2011-12-20 11:54 gasche Note Added: 0006407
2011-12-20 11:57 gasche Note Edited: 0006407 View Revisions
2011-12-20 11:59 gasche Note Edited: 0006407 View Revisions
2011-12-20 12:50 frisch Note Added: 0006409
2011-12-20 14:20 gasche Note Added: 0006414
2011-12-20 14:43 frisch Note Added: 0006416
2011-12-20 15:19 gasche Note Added: 0006418
2012-01-18 10:37 frisch Assigned To frisch =>
2012-01-18 10:37 frisch Assigned To => frisch
2012-01-18 10:37 frisch Status assigned => acknowledged
2012-01-18 10:37 frisch Assigned To frisch =>
2012-01-19 18:46 gasche File Added: 0001-optimization-unboxing-of-let-bound-tuples.patch
2012-01-19 18:52 gasche Note Added: 0006735
2012-01-19 18:57 gasche Note Edited: 0006735 View Revisions
2012-01-20 15:13 frisch Note Added: 0006758
2012-01-20 19:50 gasche File Added: 0001-avoid-float-unboxing-when-uneeded.patch
2012-01-20 20:01 gasche Note Added: 0006763
2012-12-04 17:42 frisch Note Added: 0008562
2012-12-04 17:44 frisch Note Edited: 0008562 View Revisions
2012-12-04 18:01 frisch Note Added: 0008563
2013-01-15 13:38 frisch Note Added: 0008751
2013-01-15 13:40 frisch Note Edited: 0008751 View Revisions
2013-01-15 13:40 frisch Relationship added related to 0005879
2013-01-15 13:44 frisch Note Edited: 0008751 View Revisions
2013-09-04 18:11 doligez Tag Attached: patch


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker