| Anonymous | Login | Signup for a new account | 2013-05-25 18:09 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||||
| 0005204 | OCaml | OCaml backend (code generation) | public | 2011-01-14 15:23 | 2012-12-04 18:50 | ||||||
| Reporter | frisch | ||||||||||
| Assigned To | |||||||||||
| Priority | normal | Severity | feature | Reproducibility | have not tried | ||||||
| Status | acknowledged | Resolution | open | ||||||||
| Platform | OS | OS Version | |||||||||
| Product Version | |||||||||||
| Target Version | Fixed in Version | ||||||||||
| Summary | 0005204: Unboxing parameters and more kinds of local bindings | ||||||||||
| Description | Unboxing for floats and boxed integers is currently tried when an identifier is explicitly bound by a let/match construction but not by a lambda. For instance, x and y are unboxed in the loop below: let unbox x y = let x = x +. 0. in let y = y +. 0. in let r = ref 0. in for i = 0 to n do r := !r +. x *. y done but not if the "+ 0." are removed (because the let are then simplied). A related problem is that the decision of unboxing a local binding "let x = E1 in E2" is done only by looking at E1. For instance, if E1 is an access to a float field of a record, unboxing is tried, but not if the record field is not statically a float. So x and y are unboxed in: type r = {x: float; y:float} let unbox r = let x = r.x in let y = r.y in ... but not if we define the type as: type 'a r = {x: 'a; y:'a} It would be better to try unboxing as soon as the bound expression is a float (or a boxed integer). Type information is not available, but one could find out quite easily if unboxing should be tried by looking at how the identifier is used in the body of the local binding / function. Or do people see cases where allowing unboxing in more cases is problematic? | ||||||||||
| Tags | No tags attached. | ||||||||||
| Attached Files | |||||||||||
Relationships |
|||||||||||
|
|||||||||||
Notes |
|
|
(0005775) frisch (developer) 2011-01-17 13:44 |
I've created a branch more_unboxing in the OCaml SVN. It changes the unboxing strategy. Instead of deciding whether an identifier can be unboxed by looking at the right-hand side of its definition, it considers any identifier which is really used in a unboxed float context as a candidate for unboxing. Function parameters can thus be unboxed as well. This strategy is a little bit fragile, and it might break in presence of GADTs. It would be better to propagate type information further down in the compiler to allow a cleaner strategy. |
|
(0005776) frisch (developer) 2011-01-17 13:50 |
The code below runs in 1.75 seconds on the branch vs. 2 seconds on the trunk: let f x y = let r = ref 0. in for j = 1 to 100 do for i = 1 to 10000000 do r := !r +. x *. y done done let () = f 10. 20. |
|
(0005780) Christophe Troestler (reporter) 2011-01-18 15:43 |
On my machine, it runs in 2s (more_unboxing) v.s. 2.32s (3.12 branch), so a similar kind of speedup. |
|
(0005781) Christophe Troestler (reporter) 2011-01-18 16:55 |
It is even more impressive on a program such as the one below — which shows that, with such a patch, one can factorize computations while loosing very little performance. It indeed runs in 2.01s (more_unboxing) v.s. 3.45 (3.12 branch). (The use of -inline does not really make any difference.) let g x y = let r = ref 0. in for i = 1 to 10000000 do r := !r +. x *. y done; !r let f x y = let r = ref 0. in for j = 1 to 100 do r := !r +. g x y done let () = f 10. 20. |
|
(0005782) frisch (developer) 2011-01-18 18:04 |
Christophe, thanks for your feedback on the branch! I'm not so surprised that -inline does not make any difference. It would probably make a bigger difference if you swap the number of iterations in g and f (so that there are many function calls and boxing of floats). What I don't really understand is why this last example is so much slower than the previous one with 3.12. |
|
(0005783) Christophe Troestler (reporter) 2011-01-18 23:42 |
> It would probably make a bigger difference if you swap the number of > iterations in g and f (so that there are many function calls and boxing of > floats). Strangely enough, it is not the case. |
|
(0007218) frisch (developer) 2012-03-28 18:57 |
The current branch follows the following strategy (always unbox, and box lazily): * All float local variables used somewhere in an unboxed context are kept in unboxed form. * If the variable is used in a boxed context (e.g. stored somewhere, or passed to another function), one also keeps a local boxed variable. * The local boxed variable is not updated when the variable is modified. * When accessed, if the variable is mutated somewhere in the function, the local boxed variable is refreshed: if it's content is different from the current unboxed variable, we box the unboxed variable and store the result in the boxed variable. In general, this reduces allocation as much as possible: floats are boxed only when needed, and at most once between two mutations of the variable. And as long as the boxed version is not needed, there is no boxing at all. The bad case is when the boxed version is needed quite often, with different values. In that case, the current strategy introduced a small overhead (comparing the content of the boxed and unboxed versions). (We could remove this overhead in cases where we know syntactically whether the variable has been modified or not since it was last accessed in boxed form.) Moreover, delaying boxing until it is needed might introduce some latency. I've uploaded a series of micro-benchmarks to illustrate different cases: bench1: 4.80 -> 2.60 A typical case where a local variable should really be unboxed, but it is not, because it is used (once) in unboxed form. Note that we could obtain the same speedup by manually adding some "+. 0." or "*. 1." around the boxing occurence (but we cannot tell that to real users). bench2: 1.44 -> 1.20 The boxed version is indeed used within the loop, but quite rarely. (Example of real-world case: displaying the current value every N iterations in a tight Monte-Carlo loop.) bench3: 1.16 -> 1.20 The boxed version is always needed, and always different from its previous value (worst case!). bench4: 0.65 -> 0.64 The boxed version is always needed, but most of the time, it hasn't been assigned since the previous time it was needed. bench5: 1.21 -> 0.67 The boxed version is always needed, but most of the time, it's value is the same as the previous one because it has been assigned to the same value. bench6: 1.05 -> 0.65 Same semantics as bench4, but forcing the update of the variable (i.e. an allocation with the old strategy). Of course, micro-benchmarks don't give much information, but it's good to see that: 1. The new strategy is more robust to small changes of code that could disable unboxing with the current strategy, and degrade performance significantly (bench1, bench2). It's hard to tell users to addi a "*. 1." at the end of the function! 2. Even for the worst-case scenario (boxed value is always used, with a different value each time), the degration is rather small (bench3). 3. When the boxed version is always needed, but its content is often the same as when previously accessed, results are very good and more predictible (bench 4,5,6). 4. Removing extra boxing can improve performance significantly even in loops which do "real" things (reading from memory, allocating data structures). Moreover, the new strategy would allow us to benefit even more from future optimizations that would eliminate more unboxing (but not completely), e.g. by a more aggressive inlining, or specialized calling conventions for float arguments. With the current strategy, those optimizations would be more fragile. |
|
(0007220) atavener (reporter) 2012-03-29 00:17 |
I extracted some samples from my active codebase and turned them into self-contained files... 1. noiz.ml -- This is Perlin Noise, filling a 1024x1024 array 2. ptst.ml -- Particle simulation for 1000 iterations... birth, moving through fields, death The code is in a very functional style (rather than imperative), and has not been optimized. So I haven't tried to cater to the compiler at all yet. I didn't include any visualization, but if desired I can add something simple using Graphics module. My time measurements between OCaml 3.12.0 and "more_unboxing" were: noiz.ml: 1.5s -> 1.7s ptst.ml: 2.7s -> 2.3s The only real float component of noiz.ml is linear interpolation and a simple 5th-order polynomial. The particle system is much more float heavy, mostly as 3-element vectors (in a record), and the core being the Runge-Kutta4 integrator, which is implemented in a polymorphic way, relying on a provided "multiply-accumulate" function for the type to be integrated. |
|
(0007230) frisch (developer) 2012-03-29 11:11 |
I've synchronized the branch with the trunk. atavener: thanks for posting your code. On my machine (Intel Core i7, 2.80GHz, running in 64-bit mode), I get: noiz.ml (switched from 1024x1024 to 4096x4096): 2.57 -> 2.50 ptst.ml (10000 iterations instead of 1000): 8.21 -> 8.16 As you mention, your code is written in a quite high-level functional style. Changing the final summation in noiz.ml from: let sum = Array.(fold_left (fun s arr -> fold_left (+.) s arr) 0. tex) in to: let r = ref 0. in for i = 0 to Array.length tex - 1 do let a = tex.(i) in for j = 0 to Array.length a - 1 do r := !r +. a.(j) done done; let sum = !r in we get a more interesting speedup with the new strategy: 2.55 -> 2.42 (if we completely remove the summation: 2.47 -> 2.40, so the effect of the patch on the imperative version of the summation is huge). |
|
(0007245) Christophe Troestler (reporter) 2012-03-29 17:42 edited on: 2012-03-29 18:19 |
On my machine (x86_64 GNU/Linux, Intel(R) Core(TM) i7 CPU Q 820 @ 1.73GHz): bench1: 5.05 -> 2.67 bench2: 1.47 -> 1.1 bench3: 1.07 -> 1.12 bench4: 0.60 -> 0.61 bench5: 1.07 -> 0.60 bench6: 1.12 -> 0.67 noiz: 0.19 -> 0.17 ptst: 0.48 -> 0.44 ptst: 9.89 -> 9.41 (with 10000 iterations) |
|
(0007247) Christophe Troestler (reporter) 2012-03-29 18:02 |
nbody: 14.44 -> 13.99 (but variability is greater than the gain) Many floating point computations but not so many references. Approximate a real world case. (For comparison; gcc -O3 with a similar code runs in 11.66.) |
|
(0007249) atavener (reporter) 2012-03-29 18:09 |
Haha, yeah. Might want to bump up those loop-counts, sorry. For reference, my machine is a little ultraportable (Intel Core2 Duo SU9300, 1.2Ghz, 32bit mode). It's encouraging to see that the results are generally positive! Even 10% on chunks of "normal" code which makes use of floats. |
|
(0007250) frisch (developer) 2012-03-29 20:14 |
Christophe: I can observe a comparable and quite stable speedup (about -3%) for nbody. (This speedup is probably explained by the fact that the dt argument in "advance" becomes unboxed.) A more noticeable gain is obtained for its "energy" function. Reducing the example to only calling this function 10000000 times, I get: 1.97 -> 1.83. If one abstracts the dist2 function: let dist2 dx dy dz = dx *. dx +. dy *. dy +. dz *. dz and calls it in the advance function, it doesn't change performance (2.55 -> 2.50) because of inlining. If we compile with "-inline 0", one gets a more noticeable speedup: 3.29 -> 3.07. |
|
(0007251) Christophe Troestler (reporter) 2012-03-29 22:16 |
square: 5.58 -> 5.67 Real world case consisting in integrating a functional over a triangular mesh. Requires http://forge.ocamlcore.org/projects/mesh/ [^] For some reason, the program compiled with this branch is consistently slower — but not by much, so this is no big deal. |
|
(0007305) frisch (developer) 2012-04-10 13:17 |
Christophe: do you think it would be easy to create a standalone version of square that can be compiled with a stock OCaml distribution? (As far as I can tell, you only use mesh to produce a static data structure. Maybe just output-valuing some bigarrays statically might be enough; or generating an OCaml source to reproduce this value.) |
|
(0007347) Christophe Troestler (reporter) 2012-04-12 22:51 |
I have uploaded square2.ml. It reads the file square_mesh.bin (created by square_mesh.ml which is also added for completeness). There is a problem however: when compiled with 3.12, square2 runs fine but with this branch it raises Invalid_argument("index out of bounds"). It seems that the content of the bigarrays is not restored properly. |
|
(0007348) Christophe Troestler (reporter) 2012-04-12 23:23 |
square3.ml does not use marshalling. The running times of the program compiled with 3.12 and this branch are about the same. When I see differences, the 3.12 version is slightly faster. Here is such a run: $ time ./square-3.12 && time ./square-dev #triangles: 3845, #points: 2007 5.72user 0.00system 0:05.72elapsed 99%CPU (0avgtext+0avgdata 13680maxresident)k 0inputs+0outputs (0major+909minor)pagefaults 0swaps #triangles: 3845, #points: 2007 5.92user 0.00system 0:05.93elapsed 99%CPU (0avgtext+0avgdata 13536maxresident)k 0inputs+0outputs (0major+902minor)pagefaults 0swaps |
|
(0007366) frisch (developer) 2012-04-17 11:51 |
Thanks Christophe! Here are some timings for square3 on my machine, comparing between trunk (before) and more_unboxing (after). In square brackets, I also report the allocated memory (returned by Gc.allocated_byte): 4.96 -> 5.03 [8 Gb -> 7.4 Gb] Moving the definition of f inside nonlinearity: 4.86 -> 4.95 [8 Gb -> 7.4 Gb] Same, with -inline 10000: 4.46 -> 4.40 [615 Mb -> 0.457 Mb] (better!) After looking at the generated cmm and assembly code, my guess is that the slight performance degradation comes from the compilation of the function "f". With the new compilation scheme, the parameters are unboxed at the entry of the function, which forces them to be saved on the stack (because under Unix, calling cos/sin can destroy all the xmm registers). One could probably improve this by pushing unboxing down the ulambda-code to the innermost node where the parameter is indeed used (or, as an easier and less general fix: disable unboxing when the parameter is used only once, not under a loop). But I can imagine cases where unboxing early is actually better for the pipeline, so I'm not sure we'd always win. |
|
(0007368) frisch (developer) 2012-04-17 14:33 |
I've committed a patch to push unboxing down the tree (without duplicating it). This reduces the difference between trunk and more_unboxing in the original version of square3. By adding parentheses around (u *. u) in the definition of f, the difference is again tiny, but in the other order (more_unboxing slightly faster than trunk). This is because this rewriting allows pushing the unboxing for "u" further down and avoids the problem with having this unboxing performed before the call to sin/cos. The new unboxing strategy does not change much the performance of square3. What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance. |
|
(0007370) Christophe Troestler (reporter) 2012-04-17 19:48 |
> What this example really shows is that a more aggressive inliner (and/or a dedicated calling convention for functions on floats) could allow more modular numerical code without degrading performance. That would be really nice indeed! |
|
(0007371) Christophe Troestler (reporter) 2012-04-17 19:56 |
To tackle the problem with unmarshalling, I uploaded square5. With 3.12 everything is fine. With this branch, I get the message “tr incorrect when i = 4” which means that the unmarshalling is done right but that somehow [tr] gets modified along the computations... This is rather strange (maybe a separate bug report needs to be opened?). |
|
(0007374) frisch (developer) 2012-04-18 11:27 edited on: 2012-04-18 11:30 |
Christophe: I can reproduce the bug with square5. I'll look at it. Smaller repro case: ========== open Bigarray let () = let tr0 = Array2.of_array int fortran_layout [| [| 10 |] |] in Printf.printf "#triangles: %i\n%!" (Array2.dim2 tr0); let open Marshal in let tr = from_string (to_string tr0 []) 0 in Gc.compact (); Printf.printf "#triangles: %i\n%!" (Array2.dim2 tr); assert(tr = tr0) ========== |
|
(0007375) frisch (developer) 2012-04-18 13:18 edited on: 2012-04-18 13:20 |
Mmmh, the problem disappeared after a full "make clean world opt opt.opt install". I assume we used a compilation command which refers to a different version of bigarray.a. Christophe: can you confirm? |
|
(0007385) Christophe Troestler (reporter) 2012-04-22 19:20 |
I confirm but the bug disappeared only after I updated to todays version of the branch. So this was not a mere recompilation problem. |
|
(0008564) frisch (developer) 2012-12-04 18:50 |
I will need to confirm that with a more serious methodology, but it seems the patch produce an interesting speedup over our entire unit test suite (around 6%, with some tests allocating twice fewer than before). |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2011-01-14 15:23 | frisch | New Issue | |
| 2011-01-17 13:44 | frisch | Note Added: 0005775 | |
| 2011-01-17 13:50 | frisch | Note Added: 0005776 | |
| 2011-01-18 15:43 | Christophe Troestler | Note Added: 0005780 | |
| 2011-01-18 16:55 | Christophe Troestler | Note Added: 0005781 | |
| 2011-01-18 18:04 | frisch | Note Added: 0005782 | |
| 2011-01-18 23:42 | Christophe Troestler | Note Added: 0005783 | |
| 2011-05-17 16:47 | doligez | Status | new => acknowledged |
| 2012-03-28 18:20 | frisch | File Added: bench1.ml | |
| 2012-03-28 18:20 | frisch | File Added: bench2.ml | |
| 2012-03-28 18:20 | frisch | File Added: bench3.ml | |
| 2012-03-28 18:21 | frisch | File Added: bench4.ml | |
| 2012-03-28 18:21 | frisch | File Added: bench5.ml | |
| 2012-03-28 18:21 | frisch | File Added: bench6.ml | |
| 2012-03-28 18:57 | frisch | Note Added: 0007218 | |
| 2012-03-29 00:17 | atavener | Note Added: 0007220 | |
| 2012-03-29 00:18 | atavener | File Added: noiz.ml | |
| 2012-03-29 00:18 | atavener | File Added: ptst.ml | |
| 2012-03-29 11:11 | frisch | Note Added: 0007230 | |
| 2012-03-29 17:42 | Christophe Troestler | Note Added: 0007245 | |
| 2012-03-29 17:43 | Christophe Troestler | Note Edited: 0007245 | View Revisions |
| 2012-03-29 17:52 | Christophe Troestler | File Added: nbody.ml | |
| 2012-03-29 18:02 | Christophe Troestler | Note Added: 0007247 | |
| 2012-03-29 18:09 | atavener | Note Added: 0007249 | |
| 2012-03-29 18:19 | Christophe Troestler | Note Edited: 0007245 | View Revisions |
| 2012-03-29 20:14 | frisch | Note Added: 0007250 | |
| 2012-03-29 21:58 | Christophe Troestler | File Added: square.ml | |
| 2012-03-29 22:16 | Christophe Troestler | Note Added: 0007251 | |
| 2012-04-10 13:17 | frisch | Note Added: 0007305 | |
| 2012-04-12 19:47 | frisch | Relationship added | parent of 0005133 |
| 2012-04-12 22:43 | Christophe Troestler | File Added: square_mesh.ml | |
| 2012-04-12 22:43 | Christophe Troestler | File Added: square_mesh.bin | |
| 2012-04-12 22:45 | Christophe Troestler | File Added: square2.ml | |
| 2012-04-12 22:51 | Christophe Troestler | Note Added: 0007347 | |
| 2012-04-12 23:20 | Christophe Troestler | File Added: square3.ml | |
| 2012-04-12 23:23 | Christophe Troestler | Note Added: 0007348 | |
| 2012-04-17 11:51 | frisch | Note Added: 0007366 | |
| 2012-04-17 14:33 | frisch | Note Added: 0007368 | |
| 2012-04-17 19:48 | Christophe Troestler | Note Added: 0007370 | |
| 2012-04-17 19:49 | Christophe Troestler | File Added: square4.ml | |
| 2012-04-17 19:53 | Christophe Troestler | File Added: square5.ml | |
| 2012-04-17 19:56 | Christophe Troestler | Note Added: 0007371 | |
| 2012-04-18 11:27 | frisch | Note Added: 0007374 | |
| 2012-04-18 11:30 | frisch | Note Edited: 0007374 | View Revisions |
| 2012-04-18 13:18 | frisch | Note Added: 0007375 | |
| 2012-04-18 13:20 | frisch | Note Edited: 0007375 | View Revisions |
| 2012-04-18 13:20 | frisch | Note Edited: 0007375 | View Revisions |
| 2012-04-22 19:20 | Christophe Troestler | Note Added: 0007385 | |
| 2012-06-20 11:19 | frisch | Category | OCaml general => OCaml backend (code generation) |
| 2012-07-04 16:57 | doligez | Severity | minor => feature |
| 2012-12-04 18:50 | frisch | Note Added: 0008564 | |
| Copyright © 2000 - 2011 MantisBT Group |