Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unboxing parameters and more kinds of local bindings #5204

Closed
vicuna opened this issue Jan 14, 2011 · 30 comments
Closed

Unboxing parameters and more kinds of local bindings #5204

vicuna opened this issue Jan 14, 2011 · 30 comments

Comments

@vicuna
Copy link

vicuna commented Jan 14, 2011

Original bug ID: 5204
Reporter: @alainfrisch
Status: acknowledged (set by @damiendoligez on 2011-05-17T14:47:01Z)
Resolution: open
Priority: normal
Severity: feature
Category: back end (clambda to assembly)
Parent of: #5133
Monitored by: ronan.lehy@gmail.com mehdi dario @ygrek till @jmeber "Julien Signoles" @hcarty @Chris00

Bug description

Unboxing for floats and boxed integers is currently tried when an identifier is explicitly bound by a let/match construction but not by a lambda.

For instance, x and y are unboxed in the loop below:

let unbox x y =
let x = x +. 0. in
let y = y +. 0. in

let r = ref 0. in
for i = 0 to n do
r := !r +. x *. y
done

but not if the "+ 0." are removed (because the let are then simplied).

A related problem is that the decision of unboxing a local binding "let x = E1 in E2" is done only by looking at E1. For instance, if E1 is an access to a float field of a record, unboxing is tried, but not if the record field is not statically a float.

So x and y are unboxed in:

type r = {x: float; y:float}

let unbox r =
let x = r.x in
let y = r.y in
...

but not if we define the type as:

type 'a r = {x: 'a; y:'a}

It would be better to try unboxing as soon as the bound expression is a float (or a boxed integer). Type information is not available, but one could find out quite easily if unboxing should be tried by looking at how the identifier is used in the body of the local binding / function.

Or do people see cases where allowing unboxing in more cases is problematic?

File attachments

@vicuna
Copy link
Author

vicuna commented Jan 17, 2011

Comment author: @alainfrisch

I've created a branch more_unboxing in the OCaml SVN. It changes the unboxing strategy. Instead of deciding whether an identifier can be unboxed by looking at the right-hand side of its definition, it considers any identifier which is really used in a unboxed float context as a candidate for unboxing. Function parameters can thus be unboxed as well.

This strategy is a little bit fragile, and it might break in presence of GADTs. It would be better to propagate type information further down in the compiler to allow a cleaner strategy.

@vicuna
Copy link
Author

vicuna commented Jan 17, 2011

Comment author: @alainfrisch

The code below runs in 1.75 seconds on the branch vs. 2 seconds on the trunk:

let f x y =
let r = ref 0. in
for j = 1 to 100 do
for i = 1 to 10000000 do
r := !r +. x *. y
done
done

let () =
f 10. 20.

@vicuna
Copy link
Author

vicuna commented Jan 18, 2011

Comment author: @Chris00

On my machine, it runs in 2s (more_unboxing) v.s. 2.32s (3.12 branch), so a similar kind of speedup.

@vicuna
Copy link
Author

vicuna commented Jan 18, 2011

Comment author: @Chris00

It is even more impressive on a program such as the one below — which shows that, with such a patch, one can factorize computations while loosing very little performance. It indeed runs in 2.01s (more_unboxing) v.s. 3.45 (3.12 branch). (The use of -inline does not really make any difference.)

let g x y =
let r = ref 0. in
for i = 1 to 10000000 do
r := !r +. x *. y
done;
!r

let f x y =
let r = ref 0. in
for j = 1 to 100 do
r := !r +. g x y
done

let () =
f 10. 20.

@vicuna
Copy link
Author

vicuna commented Jan 18, 2011

Comment author: @alainfrisch

Christophe, thanks for your feedback on the branch!

I'm not so surprised that -inline does not make any difference. It would probably make a bigger difference if you swap the number of iterations in g and f (so that there are many function calls and boxing of floats).

What I don't really understand is why this last example is so much slower than the previous one with 3.12.

@vicuna
Copy link
Author

vicuna commented Jan 18, 2011

Comment author: @Chris00

It would probably make a bigger difference if you swap the number of
iterations in g and f (so that there are many function calls and boxing of
floats).

Strangely enough, it is not the case.

@vicuna
Copy link
Author

vicuna commented Mar 28, 2012

Comment author: @alainfrisch

The current branch follows the following strategy (always unbox,
and box lazily):

  • All float local variables used somewhere in an unboxed context are
    kept in unboxed form.

  • If the variable is used in a boxed context (e.g. stored somewhere,
    or passed to another function), one also keeps a local boxed
    variable.

  • The local boxed variable is not updated when the variable is
    modified.

  • When accessed, if the variable is mutated somewhere in the
    function, the local boxed variable is refreshed: if it's content is
    different from the current unboxed variable, we box the unboxed
    variable and store the result in the boxed variable.

In general, this reduces allocation as much as possible: floats are
boxed only when needed, and at most once between two mutations of the
variable. And as long as the boxed version is not needed, there is no
boxing at all. The bad case is when the boxed version is needed quite
often, with different values. In that case, the current strategy
introduced a small overhead (comparing the content of the boxed and
unboxed versions). (We could remove this overhead in cases where we
know syntactically whether the variable has been modified or not since
it was last accessed in boxed form.) Moreover, delaying boxing until
it is needed might introduce some latency.

I've uploaded a series of micro-benchmarks to illustrate different cases:

bench1: 4.80 -> 2.60

A typical case where a local variable should really be unboxed, but
it is not, because it is used (once) in unboxed form. Note that we
could obtain the same speedup by manually adding some "+. 0." or
"*. 1." around the boxing occurence (but we cannot tell that to real
users).

bench2: 1.44 -> 1.20

The boxed version is indeed used within the loop, but quite rarely.
(Example of real-world case: displaying the current value every N
iterations in a tight Monte-Carlo loop.)

bench3: 1.16 -> 1.20

The boxed version is always needed, and always different from its
previous value (worst case!).

bench4: 0.65 -> 0.64

The boxed version is always needed, but most of the time, it hasn't
been assigned since the previous time it was needed.

bench5: 1.21 -> 0.67

The boxed version is always needed, but most of the time, it's value
is the same as the previous one because it has been assigned to the
same value.

bench6: 1.05 -> 0.65

Same semantics as bench4, but forcing the update of the variable
(i.e. an allocation with the old strategy).

Of course, micro-benchmarks don't give much information, but it's good
to see that:

  1. The new strategy is more robust to small changes of code that
    could disable unboxing with the current strategy, and degrade
    performance significantly (bench1, bench2). It's hard to tell
    users to addi a "*. 1." at the end of the function!

  2. Even for the worst-case scenario (boxed value is always used,
    with a different value each time), the degration is rather small
    (bench3).

  3. When the boxed version is always needed, but its content is often
    the same as when previously accessed, results are very good and
    more predictible (bench 4,5,6).

  4. Removing extra boxing can improve performance significantly
    even in loops which do "real" things (reading from memory,
    allocating data structures).

Moreover, the new strategy would allow us to benefit even more from
future optimizations that would eliminate more unboxing (but not
completely), e.g. by a more aggressive inlining, or specialized
calling conventions for float arguments. With the current strategy,
those optimizations would be more fragile.

@vicuna
Copy link
Author

vicuna commented Mar 28, 2012

Comment author: atavener

I extracted some samples from my active codebase and turned them into self-contained files...

  1. noiz.ml -- This is Perlin Noise, filling a 1024x1024 array
  2. ptst.ml -- Particle simulation for 1000 iterations... birth, moving through fields, death

The code is in a very functional style (rather than imperative), and has not been optimized. So I haven't tried to cater to the compiler at all yet. I didn't include any visualization, but if desired I can add something simple using Graphics module.

My time measurements between OCaml 3.12.0 and "more_unboxing" were:

noiz.ml: 1.5s -> 1.7s
ptst.ml: 2.7s -> 2.3s

The only real float component of noiz.ml is linear interpolation and a simple 5th-order polynomial.

The particle system is much more float heavy, mostly as 3-element vectors (in a record), and the core being the Runge-Kutta4 integrator, which is implemented in a polymorphic way, relying on a provided "multiply-accumulate" function for the type to be integrated.

@vicuna
Copy link
Author

vicuna commented Mar 29, 2012

Comment author: @alainfrisch

I've synchronized the branch with the trunk.

atavener: thanks for posting your code. On my machine (Intel Core i7, 2.80GHz, running in 64-bit mode), I get:

noiz.ml (switched from 1024x1024 to 4096x4096): 2.57 -> 2.50
ptst.ml (10000 iterations instead of 1000): 8.21 -> 8.16

As you mention, your code is written in a quite high-level functional style. Changing the final summation in noiz.ml from:

let sum = Array.(fold_left (fun s arr -> fold_left (+.) s arr) 0. tex) in

to:

let r = ref 0. in
for i = 0 to Array.length tex - 1 do
let a = tex.(i) in
for j = 0 to Array.length a - 1 do
r := !r +. a.(j)
done
done;
let sum = !r in

we get a more interesting speedup with the new strategy: 2.55 -> 2.42
(if we completely remove the summation: 2.47 -> 2.40, so the effect of the patch on the imperative version of the summation is huge).

@vicuna
Copy link
Author

vicuna commented Mar 29, 2012

Comment author: @Chris00

On my machine (x86_64 GNU/Linux, Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz):
bench1: 5.05 -> 2.67
bench2: 1.47 -> 1.1
bench3: 1.07 -> 1.12
bench4: 0.60 -> 0.61
bench5: 1.07 -> 0.60
bench6: 1.12 -> 0.67
noiz: 0.19 -> 0.17
ptst: 0.48 -> 0.44
ptst: 9.89 -> 9.41 (with 10000 iterations)

@vicuna
Copy link
Author

vicuna commented Mar 29, 2012

Comment author: @Chris00

nbody: 14.44 -> 13.99 (but variability is greater than the gain)

Many floating point computations but not so many references.
Approximate a real world case.

(For comparison; gcc -O3 with a similar code runs in 11.66.)

@vicuna
Copy link
Author

vicuna commented Mar 29, 2012

Comment author: atavener

Haha, yeah. Might want to bump up those loop-counts, sorry.
For reference, my machine is a little ultraportable (Intel Core2 Duo SU9300, 1.2Ghz, 32bit mode).

It's encouraging to see that the results are generally positive! Even 10% on chunks of "normal" code which makes use of floats.

@vicuna
Copy link
Author

vicuna commented Mar 29, 2012

Comment author: @alainfrisch

Christophe: I can observe a comparable and quite stable speedup (about -3%) for nbody. (This speedup is probably explained by the fact that the dt argument in "advance" becomes unboxed.)

A more noticeable gain is obtained for its "energy" function. Reducing the example to only calling this function 10000000 times, I get: 1.97 -> 1.83.

If one abstracts the dist2 function:

let dist2 dx dy dz =
dx *. dx +. dy *. dy +. dz *. dz

and calls it in the advance function, it doesn't change performance (2.55 -> 2.50) because of inlining. If we compile with "-inline 0", one gets a more noticeable speedup: 3.29 -> 3.07.

@vicuna
Copy link
Author

vicuna commented Mar 29, 2012

Comment author: @Chris00

square: 5.58 -> 5.67

Real world case consisting in integrating a functional over a triangular mesh.

Requires http://forge.ocamlcore.org/projects/mesh/

For some reason, the program compiled with this branch is consistently slower — but not by much, so this is no big deal.

@vicuna
Copy link
Author

vicuna commented Apr 10, 2012

Comment author: @alainfrisch

Christophe: do you think it would be easy to create a standalone version of square that can be compiled with a stock OCaml distribution? (As far as I can tell, you only use mesh to produce a static data structure. Maybe just output-valuing some bigarrays statically might be enough; or generating an OCaml source to reproduce this value.)

@vicuna
Copy link
Author

vicuna commented Apr 12, 2012

Comment author: @Chris00

I have uploaded square2.ml. It reads the file square_mesh.bin (created by square_mesh.ml which is also added for completeness). There is a problem however: when compiled with 3.12, square2 runs fine but with this branch it raises Invalid_argument("index out of bounds"). It seems that the content of the bigarrays is not restored properly.

@vicuna
Copy link
Author

vicuna commented Apr 12, 2012

Comment author: @Chris00

square3.ml does not use marshalling.

The running times of the program compiled with 3.12 and this branch are about the same. When I see differences, the 3.12 version is slightly faster. Here is such a run:

$ time ./square-3.12 && time ./square-dev
#triangles: 3845, #points: 2007
5.72user 0.00system 0:05.72elapsed 99%CPU (0avgtext+0avgdata 13680maxresident)k
0inputs+0outputs (0major+909minor)pagefaults 0swaps
#triangles: 3845, #points: 2007
5.92user 0.00system 0:05.93elapsed 99%CPU (0avgtext+0avgdata 13536maxresident)k
0inputs+0outputs (0major+902minor)pagefaults 0swaps

@vicuna
Copy link
Author

vicuna commented Apr 17, 2012

Comment author: @alainfrisch

Thanks Christophe!

Here are some timings for square3 on my machine, comparing between trunk (before) and more_unboxing (after). In square brackets, I also report the allocated memory (returned by Gc.allocated_byte):

4.96 -> 5.03 [8 Gb -> 7.4 Gb]

Moving the definition of f inside nonlinearity:

4.86 -> 4.95 [8 Gb -> 7.4 Gb]

Same, with -inline 10000:

4.46 -> 4.40 [615 Mb -> 0.457 Mb] (better!)

After looking at the generated cmm and assembly code, my guess is that the slight performance degradation comes from the compilation of the function "f". With the new compilation scheme, the parameters are unboxed at the entry of the function, which forces them to be saved on the stack (because under Unix, calling cos/sin can destroy all the xmm registers). One could probably improve this by pushing unboxing down the ulambda-code to the innermost node where the parameter is indeed used (or, as an easier and less general fix: disable unboxing when the parameter is used only once, not under a loop). But I can imagine cases where unboxing early is actually better for the pipeline, so I'm not sure we'd always win.

@vicuna
Copy link
Author

vicuna commented Apr 17, 2012

Comment author: @alainfrisch

I've committed a patch to push unboxing down the tree (without duplicating it).
This reduces the difference between trunk and more_unboxing in the original version of square3. By adding parentheses around (u *. u) in the definition of f, the difference is again tiny, but in the other order (more_unboxing slightly faster than trunk). This is because this rewriting allows pushing the unboxing for "u" further down and avoids the problem with having this unboxing performed before the call to sin/cos.

The new unboxing strategy does not change much the performance of square3. What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance.

@vicuna
Copy link
Author

vicuna commented Apr 17, 2012

Comment author: @Chris00

What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance.

That would be really nice indeed!

@vicuna
Copy link
Author

vicuna commented Apr 17, 2012

Comment author: @Chris00

To tackle the problem with unmarshalling, I uploaded square5. With 3.12 everything is fine. With this branch, I get the message “tr incorrect when i = 4” which means that the unmarshalling is done right but that somehow [tr] gets modified along the computations... This is rather strange (maybe a separate bug report needs to be opened?).

@vicuna
Copy link
Author

vicuna commented Apr 18, 2012

Comment author: @alainfrisch

Christophe: I can reproduce the bug with square5. I'll look at it.

Smaller repro case:

==========
open Bigarray

let () =
let tr0 = Array2.of_array int fortran_layout [| [| 10 |] |] in
Printf.printf "#triangles: %i\n%!" (Array2.dim2 tr0);
let open Marshal in
let tr = from_string (to_string tr0 []) 0 in
Gc.compact ();
Printf.printf "#triangles: %i\n%!" (Array2.dim2 tr);
assert(tr = tr0)

@vicuna
Copy link
Author

vicuna commented Apr 18, 2012

Comment author: @alainfrisch

Mmmh, the problem disappeared after a full "make clean world opt opt.opt install". I assume we used a compilation command which refers to a different version of bigarray.a. Christophe: can you confirm?

@vicuna
Copy link
Author

vicuna commented Apr 22, 2012

Comment author: @Chris00

I confirm but the bug disappeared only after I updated to todays version of the branch. So this was not a mere recompilation problem.

@vicuna
Copy link
Author

vicuna commented Dec 4, 2012

Comment author: @alainfrisch

I will need to confirm that with a more serious methodology, but it seems the patch produce an interesting speedup over our entire unit test suite (around 6%, with some tests allocating twice fewer than before).

@vicuna
Copy link
Author

vicuna commented Jul 12, 2016

Comment author: @gasche

What is the status of this PR, has it been completely subsumed by the now-merged #336?

#336

@vicuna
Copy link
Author

vicuna commented Jul 12, 2016

Comment author: @alainfrisch

One should rerun the various benchmarks provided here to check that their boxing behavior is as expected.

But globally, the new scheme will never unbox a function parameter. So in

let unbox x y =
  let r = ref 0. in
  for i = 0 to n do
    r := !r +. x *. y
  done

accessing the value for x or y in the loop will result in a memory read every time. The new strategy is more about removing costly "boxing" than removing repeated "unboxing".

Btw, people interested in this topic can read http://www.lexifi.com/blog/unboxed-floats-ocaml which summarizes the recent evolutions of that strategy.

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label Apr 26, 2021
@gasche gasche removed the Stale label Apr 26, 2021
@gasche
Copy link
Member

gasche commented Apr 26, 2021

I wrote a summary of the work on parameter unboxing at #5894 (comment)

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants