Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005204OCamlOCaml backend (code generation)public2011-01-14 15:232012-12-04 18:50
Reporterfrisch 
Assigned To 
PrioritynormalSeverityfeatureReproducibilityhave not tried
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version 
Target VersionFixed in Version 
Summary0005204: Unboxing parameters and more kinds of local bindings
DescriptionUnboxing for floats and boxed integers is currently tried when an identifier is explicitly bound by a let/match construction but not by a lambda.

For instance, x and y are unboxed in the loop below:

let unbox x y =
  let x = x +. 0. in
  let y = y +. 0. in

  let r = ref 0. in
  for i = 0 to n do
    r := !r +. x *. y
  done

but not if the "+ 0." are removed (because the let are then simplied).


A related problem is that the decision of unboxing a local binding "let x = E1 in E2" is done only by looking at E1. For instance, if E1 is an access to a float field of a record, unboxing is tried, but not if the record field is not statically a float.

So x and y are unboxed in:

type r = {x: float; y:float}

let unbox r =
  let x = r.x in
  let y = r.y in
  ...

but not if we define the type as:

type 'a r = {x: 'a; y:'a}


It would be better to try unboxing as soon as the bound expression is a float (or a boxed integer). Type information is not available, but one could find out quite easily if unboxing should be tried by looking at how the identifier is used in the body of the local binding / function.

Or do people see cases where allowing unboxing in more cases is problematic?
TagsNo tags attached.
Attached Files? file icon bench1.ml [^] (313 bytes) 2012-03-28 18:20 [Show Content]
? file icon bench2.ml [^] (366 bytes) 2012-03-28 18:20 [Show Content]
? file icon bench3.ml [^] (343 bytes) 2012-03-28 18:20 [Show Content]
? file icon bench4.ml [^] (382 bytes) 2012-03-28 18:21 [Show Content]
? file icon bench5.ml [^] (452 bytes) 2012-03-28 18:21 [Show Content]
? file icon bench6.ml [^] (325 bytes) 2012-03-28 18:21 [Show Content]
? file icon noiz.ml [^] (3,619 bytes) 2012-03-29 00:18 [Show Content]
? file icon ptst.ml [^] (3,683 bytes) 2012-03-29 00:18 [Show Content]
? file icon nbody.ml [^] (3,891 bytes) 2012-03-29 17:52 [Show Content]
? file icon square.ml [^] (2,211 bytes) 2012-03-29 21:58 [Show Content]
? file icon square_mesh.ml [^] (1,043 bytes) 2012-04-12 22:43 [Show Content]
? file icon square_mesh.bin [^] (86,419 bytes) 2012-04-12 22:43
? file icon square2.ml [^] (2,024 bytes) 2012-04-12 22:45 [Show Content]
? file icon square3.ml [^] (133,848 bytes) 2012-04-12 23:20 [Show Content]
? file icon square4.ml [^] (64,995 bytes) 2012-04-17 19:49 [Show Content]
? file icon square5.ml [^] (65,075 bytes) 2012-04-17 19:53 [Show Content]

- Relationships
parent of 0005133acknowledged Improve float allocation 
Not all the children of this issue are yet resolved or closed.

-  Notes
(0005775)
frisch (developer)
2011-01-17 13:44

I've created a branch more_unboxing in the OCaml SVN. It changes the unboxing strategy. Instead of deciding whether an identifier can be unboxed by looking at the right-hand side of its definition, it considers any identifier which is really used in a unboxed float context as a candidate for unboxing. Function parameters can thus be unboxed as well.

This strategy is a little bit fragile, and it might break in presence of GADTs. It would be better to propagate type information further down in the compiler to allow a cleaner strategy.
(0005776)
frisch (developer)
2011-01-17 13:50

The code below runs in 1.75 seconds on the branch vs. 2 seconds on the trunk:

let f x y =
  let r = ref 0. in
  for j = 1 to 100 do
    for i = 1 to 10000000 do
      r := !r +. x *. y
    done
  done

let () =
  f 10. 20.
(0005780)
Christophe Troestler (reporter)
2011-01-18 15:43

On my machine, it runs in 2s (more_unboxing) v.s. 2.32s (3.12 branch), so a similar kind of speedup.
(0005781)
Christophe Troestler (reporter)
2011-01-18 16:55

It is even more impressive on a program such as the one below — which shows that, with such a patch, one can factorize computations while loosing very little performance. It indeed runs in 2.01s (more_unboxing) v.s. 3.45 (3.12 branch). (The use of -inline does not really make any difference.)

let g x y =
  let r = ref 0. in
  for i = 1 to 10000000 do
    r := !r +. x *. y
  done;
  !r

let f x y =
  let r = ref 0. in
  for j = 1 to 100 do
    r := !r +. g x y
  done

let () =
  f 10. 20.
(0005782)
frisch (developer)
2011-01-18 18:04

Christophe, thanks for your feedback on the branch!

I'm not so surprised that -inline does not make any difference. It would probably make a bigger difference if you swap the number of iterations in g and f (so that there are many function calls and boxing of floats).

What I don't really understand is why this last example is so much slower than the previous one with 3.12.
(0005783)
Christophe Troestler (reporter)
2011-01-18 23:42

> It would probably make a bigger difference if you swap the number of
> iterations in g and f (so that there are many function calls and boxing of
> floats).

Strangely enough, it is not the case.
(0007218)
frisch (developer)
2012-03-28 18:57

The current branch follows the following strategy (always unbox,
and box lazily):

 * All float local variables used somewhere in an unboxed context are
   kept in unboxed form.

 * If the variable is used in a boxed context (e.g. stored somewhere,
   or passed to another function), one also keeps a local boxed
   variable.

 * The local boxed variable is not updated when the variable is
   modified.

 * When accessed, if the variable is mutated somewhere in the
   function, the local boxed variable is refreshed: if it's content is
   different from the current unboxed variable, we box the unboxed
   variable and store the result in the boxed variable.


In general, this reduces allocation as much as possible: floats are
boxed only when needed, and at most once between two mutations of the
variable. And as long as the boxed version is not needed, there is no
boxing at all. The bad case is when the boxed version is needed quite
often, with different values. In that case, the current strategy
introduced a small overhead (comparing the content of the boxed and
unboxed versions). (We could remove this overhead in cases where we
know syntactically whether the variable has been modified or not since
it was last accessed in boxed form.) Moreover, delaying boxing until
it is needed might introduce some latency.

I've uploaded a series of micro-benchmarks to illustrate different cases:

bench1: 4.80 -> 2.60

  A typical case where a local variable should really be unboxed, but
  it is not, because it is used (once) in unboxed form. Note that we
  could obtain the same speedup by manually adding some "+. 0." or
  "*. 1." around the boxing occurence (but we cannot tell that to real
  users).

bench2: 1.44 -> 1.20

  The boxed version is indeed used within the loop, but quite rarely.
  (Example of real-world case: displaying the current value every N
  iterations in a tight Monte-Carlo loop.)

bench3: 1.16 -> 1.20

  The boxed version is always needed, and always different from its
  previous value (worst case!).

bench4: 0.65 -> 0.64

  The boxed version is always needed, but most of the time, it hasn't
  been assigned since the previous time it was needed.

bench5: 1.21 -> 0.67

  The boxed version is always needed, but most of the time, it's value
  is the same as the previous one because it has been assigned to the
  same value.

bench6: 1.05 -> 0.65

  Same semantics as bench4, but forcing the update of the variable
  (i.e. an allocation with the old strategy).


Of course, micro-benchmarks don't give much information, but it's good
to see that:

  1. The new strategy is more robust to small changes of code that
     could disable unboxing with the current strategy, and degrade
     performance significantly (bench1, bench2). It's hard to tell
     users to addi a "*. 1." at the end of the function!

  2. Even for the worst-case scenario (boxed value is always used,
     with a different value each time), the degration is rather small
     (bench3).

  3. When the boxed version is always needed, but its content is often
     the same as when previously accessed, results are very good and
     more predictible (bench 4,5,6).

  4. Removing extra boxing can improve performance significantly
     even in loops which do "real" things (reading from memory,
     allocating data structures).


Moreover, the new strategy would allow us to benefit even more from
future optimizations that would eliminate more unboxing (but not
completely), e.g. by a more aggressive inlining, or specialized
calling conventions for float arguments. With the current strategy,
those optimizations would be more fragile.
(0007220)
atavener (reporter)
2012-03-29 00:17

I extracted some samples from my active codebase and turned them into self-contained files...

1. noiz.ml -- This is Perlin Noise, filling a 1024x1024 array
2. ptst.ml -- Particle simulation for 1000 iterations... birth, moving through fields, death

The code is in a very functional style (rather than imperative), and has not been optimized. So I haven't tried to cater to the compiler at all yet. I didn't include any visualization, but if desired I can add something simple using Graphics module.

My time measurements between OCaml 3.12.0 and "more_unboxing" were:

noiz.ml: 1.5s -> 1.7s
ptst.ml: 2.7s -> 2.3s

The only real float component of noiz.ml is linear interpolation and a simple 5th-order polynomial.

The particle system is much more float heavy, mostly as 3-element vectors (in a record), and the core being the Runge-Kutta4 integrator, which is implemented in a polymorphic way, relying on a provided "multiply-accumulate" function for the type to be integrated.
(0007230)
frisch (developer)
2012-03-29 11:11

I've synchronized the branch with the trunk.

atavener: thanks for posting your code. On my machine (Intel Core i7, 2.80GHz, running in 64-bit mode), I get:

   noiz.ml (switched from 1024x1024 to 4096x4096): 2.57 -> 2.50
   ptst.ml (10000 iterations instead of 1000): 8.21 -> 8.16

As you mention, your code is written in a quite high-level functional style. Changing the final summation in noiz.ml from:

  let sum = Array.(fold_left (fun s arr -> fold_left (+.) s arr) 0. tex) in

to:

  let r = ref 0. in
  for i = 0 to Array.length tex - 1 do
    let a = tex.(i) in
    for j = 0 to Array.length a - 1 do
      r := !r +. a.(j)
    done
  done;
  let sum = !r in

we get a more interesting speedup with the new strategy: 2.55 -> 2.42
(if we completely remove the summation: 2.47 -> 2.40, so the effect of the patch on the imperative version of the summation is huge).
(0007245)
Christophe Troestler (reporter)
2012-03-29 17:42
edited on: 2012-03-29 18:19

On my machine (x86_64 GNU/Linux, Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz):
bench1: 5.05 -> 2.67
bench2: 1.47 -> 1.1
bench3: 1.07 -> 1.12
bench4: 0.60 -> 0.61
bench5: 1.07 -> 0.60
bench6: 1.12 -> 0.67
noiz: 0.19 -> 0.17
ptst: 0.48 -> 0.44
ptst: 9.89 -> 9.41 (with 10000 iterations)

(0007247)
Christophe Troestler (reporter)
2012-03-29 18:02

nbody: 14.44 -> 13.99 (but variability is greater than the gain)

  Many floating point computations but not so many references.
  Approximate a real world case.

(For comparison; gcc -O3 with a similar code runs in 11.66.)
(0007249)
atavener (reporter)
2012-03-29 18:09

Haha, yeah. Might want to bump up those loop-counts, sorry.
For reference, my machine is a little ultraportable (Intel Core2 Duo SU9300, 1.2Ghz, 32bit mode).

It's encouraging to see that the results are generally positive! Even 10% on chunks of "normal" code which makes use of floats.
(0007250)
frisch (developer)
2012-03-29 20:14

Christophe: I can observe a comparable and quite stable speedup (about -3%) for nbody. (This speedup is probably explained by the fact that the dt argument in "advance" becomes unboxed.)

A more noticeable gain is obtained for its "energy" function. Reducing the example to only calling this function 10000000 times, I get: 1.97 -> 1.83.

If one abstracts the dist2 function:

 let dist2 dx dy dz =
  dx *. dx +. dy *. dy +. dz *. dz

and calls it in the advance function, it doesn't change performance (2.55 -> 2.50) because of inlining. If we compile with "-inline 0", one gets a more noticeable speedup: 3.29 -> 3.07.
(0007251)
Christophe Troestler (reporter)
2012-03-29 22:16

square: 5.58 -> 5.67

  Real world case consisting in integrating a functional over a triangular mesh.

Requires http://forge.ocamlcore.org/projects/mesh/ [^]

For some reason, the program compiled with this branch is consistently slower — but not by much, so this is no big deal.
(0007305)
frisch (developer)
2012-04-10 13:17

Christophe: do you think it would be easy to create a standalone version of square that can be compiled with a stock OCaml distribution? (As far as I can tell, you only use mesh to produce a static data structure. Maybe just output-valuing some bigarrays statically might be enough; or generating an OCaml source to reproduce this value.)
(0007347)
Christophe Troestler (reporter)
2012-04-12 22:51

I have uploaded square2.ml. It reads the file square_mesh.bin (created by square_mesh.ml which is also added for completeness). There is a problem however: when compiled with 3.12, square2 runs fine but with this branch it raises Invalid_argument("index out of bounds"). It seems that the content of the bigarrays is not restored properly.
(0007348)
Christophe Troestler (reporter)
2012-04-12 23:23

square3.ml does not use marshalling.

The running times of the program compiled with 3.12 and this branch are about the same. When I see differences, the 3.12 version is slightly faster. Here is such a run:

$ time ./square-3.12 && time ./square-dev
#triangles: 3845, #points: 2007
5.72user 0.00system 0:05.72elapsed 99%CPU (0avgtext+0avgdata 13680maxresident)k
0inputs+0outputs (0major+909minor)pagefaults 0swaps
#triangles: 3845, #points: 2007
5.92user 0.00system 0:05.93elapsed 99%CPU (0avgtext+0avgdata 13536maxresident)k
0inputs+0outputs (0major+902minor)pagefaults 0swaps
(0007366)
frisch (developer)
2012-04-17 11:51

Thanks Christophe!

Here are some timings for square3 on my machine, comparing between trunk (before) and more_unboxing (after). In square brackets, I also report the allocated memory (returned by Gc.allocated_byte):

 4.96 -> 5.03 [8 Gb -> 7.4 Gb]

Moving the definition of f inside nonlinearity:

 4.86 -> 4.95 [8 Gb -> 7.4 Gb]

Same, with -inline 10000:

 4.46 -> 4.40 [615 Mb -> 0.457 Mb] (better!)


After looking at the generated cmm and assembly code, my guess is that the slight performance degradation comes from the compilation of the function "f". With the new compilation scheme, the parameters are unboxed at the entry of the function, which forces them to be saved on the stack (because under Unix, calling cos/sin can destroy all the xmm registers). One could probably improve this by pushing unboxing down the ulambda-code to the innermost node where the parameter is indeed used (or, as an easier and less general fix: disable unboxing when the parameter is used only once, not under a loop). But I can imagine cases where unboxing early is actually better for the pipeline, so I'm not sure we'd always win.
(0007368)
frisch (developer)
2012-04-17 14:33

I've committed a patch to push unboxing down the tree (without duplicating it).
This reduces the difference between trunk and more_unboxing in the original version of square3. By adding parentheses around (u *. u) in the definition of f, the difference is again tiny, but in the other order (more_unboxing slightly faster than trunk). This is because this rewriting allows pushing the unboxing for "u" further down and avoids the problem with having this unboxing performed before the call to sin/cos.

The new unboxing strategy does not change much the performance of square3. What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance.
(0007370)
Christophe Troestler (reporter)
2012-04-17 19:48

> What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance.

That would be really nice indeed!
(0007371)
Christophe Troestler (reporter)
2012-04-17 19:56

To tackle the problem with unmarshalling, I uploaded square5. With 3.12 everything is fine. With this branch, I get the message “tr incorrect when i = 4” which means that the unmarshalling is done right but that somehow [tr] gets modified along the computations... This is rather strange (maybe a separate bug report needs to be opened?).
(0007374)
frisch (developer)
2012-04-18 11:27
edited on: 2012-04-18 11:30

Christophe: I can reproduce the bug with square5. I'll look at it.

Smaller repro case:

==========
open Bigarray

let () =
  let tr0 = Array2.of_array int fortran_layout [| [| 10 |] |] in
  Printf.printf "#triangles: %i\n%!" (Array2.dim2 tr0);
  let open Marshal in
  let tr = from_string (to_string tr0 []) 0 in
  Gc.compact ();
  Printf.printf "#triangles: %i\n%!" (Array2.dim2 tr);
  assert(tr = tr0)
==========

(0007375)
frisch (developer)
2012-04-18 13:18
edited on: 2012-04-18 13:20

Mmmh, the problem disappeared after a full "make clean world opt opt.opt install". I assume we used a compilation command which refers to a different version of bigarray.a. Christophe: can you confirm?

(0007385)
Christophe Troestler (reporter)
2012-04-22 19:20

I confirm but the bug disappeared only after I updated to todays version of the branch. So this was not a mere recompilation problem.
(0008564)
frisch (developer)
2012-12-04 18:50

I will need to confirm that with a more serious methodology, but it seems the patch produce an interesting speedup over our entire unit test suite (around 6%, with some tests allocating twice fewer than before).

- Issue History
Date Modified Username Field Change
2011-01-14 15:23 frisch New Issue
2011-01-17 13:44 frisch Note Added: 0005775
2011-01-17 13:50 frisch Note Added: 0005776
2011-01-18 15:43 Christophe Troestler Note Added: 0005780
2011-01-18 16:55 Christophe Troestler Note Added: 0005781
2011-01-18 18:04 frisch Note Added: 0005782
2011-01-18 23:42 Christophe Troestler Note Added: 0005783
2011-05-17 16:47 doligez Status new => acknowledged
2012-03-28 18:20 frisch File Added: bench1.ml
2012-03-28 18:20 frisch File Added: bench2.ml
2012-03-28 18:20 frisch File Added: bench3.ml
2012-03-28 18:21 frisch File Added: bench4.ml
2012-03-28 18:21 frisch File Added: bench5.ml
2012-03-28 18:21 frisch File Added: bench6.ml
2012-03-28 18:57 frisch Note Added: 0007218
2012-03-29 00:17 atavener Note Added: 0007220
2012-03-29 00:18 atavener File Added: noiz.ml
2012-03-29 00:18 atavener File Added: ptst.ml
2012-03-29 11:11 frisch Note Added: 0007230
2012-03-29 17:42 Christophe Troestler Note Added: 0007245
2012-03-29 17:43 Christophe Troestler Note Edited: 0007245 View Revisions
2012-03-29 17:52 Christophe Troestler File Added: nbody.ml
2012-03-29 18:02 Christophe Troestler Note Added: 0007247
2012-03-29 18:09 atavener Note Added: 0007249
2012-03-29 18:19 Christophe Troestler Note Edited: 0007245 View Revisions
2012-03-29 20:14 frisch Note Added: 0007250
2012-03-29 21:58 Christophe Troestler File Added: square.ml
2012-03-29 22:16 Christophe Troestler Note Added: 0007251
2012-04-10 13:17 frisch Note Added: 0007305
2012-04-12 19:47 frisch Relationship added parent of 0005133
2012-04-12 22:43 Christophe Troestler File Added: square_mesh.ml
2012-04-12 22:43 Christophe Troestler File Added: square_mesh.bin
2012-04-12 22:45 Christophe Troestler File Added: square2.ml
2012-04-12 22:51 Christophe Troestler Note Added: 0007347
2012-04-12 23:20 Christophe Troestler File Added: square3.ml
2012-04-12 23:23 Christophe Troestler Note Added: 0007348
2012-04-17 11:51 frisch Note Added: 0007366
2012-04-17 14:33 frisch Note Added: 0007368
2012-04-17 19:48 Christophe Troestler Note Added: 0007370
2012-04-17 19:49 Christophe Troestler File Added: square4.ml
2012-04-17 19:53 Christophe Troestler File Added: square5.ml
2012-04-17 19:56 Christophe Troestler Note Added: 0007371
2012-04-18 11:27 frisch Note Added: 0007374
2012-04-18 11:30 frisch Note Edited: 0007374 View Revisions
2012-04-18 13:18 frisch Note Added: 0007375
2012-04-18 13:20 frisch Note Edited: 0007375 View Revisions
2012-04-18 13:20 frisch Note Edited: 0007375 View Revisions
2012-04-22 19:20 Christophe Troestler Note Added: 0007385
2012-06-20 11:19 frisch Category OCaml general => OCaml backend (code generation)
2012-07-04 16:57 doligez Severity minor => feature
2012-12-04 18:50 frisch Note Added: 0008564


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker