| Anonymous | Login | Signup for a new account | 2013-05-21 10:50 CEST | ![]() |
| Main | My View | View Issues | Change Log | Roadmap |
| View Issue Details [ Jump to Notes ] | [ Issue History ] [ Print ] | ||||||||||
| ID | Project | Category | View Status | Date Submitted | Last Update | ||||||
| 0005203 | OCaml | OCaml backend (code generation) | public | 2011-01-13 03:42 | 2012-06-21 23:08 | ||||||
| Reporter | Christophe Troestler | ||||||||||
| Assigned To | |||||||||||
| Priority | normal | Severity | feature | Reproducibility | always | ||||||
| Status | acknowledged | Resolution | open | ||||||||
| Platform | OS | OS Version | |||||||||
| Product Version | 3.12.1+dev | ||||||||||
| Target Version | Fixed in Version | ||||||||||
| Summary | 0005203: Faster bigarray access | ||||||||||
| Description | Bigarray access seems to be 20–30% slower than array access. For example, type vec = (float, float64_elt, c_layout) Array1.t let ba (a: vec) = let s = ref 0. in for i = 0 to n-1 do s := !s +. a.{i} done let arr (a: float array) = let s = ref 0. in for i = 0 to n-1 do s := !s +. a.(i) done ran on (big)arrays of size 10000 gives Rate ba arr ba 72519+-1524/s -- -26% arr 97826+-1351/s 35% -- on an Intel core i7 Q 820 @ 1.73GHz (x86_64) running Debian GNU/Linux. Would it be possible to make access to (float) bigarrays as fast as (float) arrays when their type is statically known? That would be very much appreciated! | ||||||||||
| Tags | No tags attached. | ||||||||||
| Attached Files | |||||||||||
Notes |
|
|
(0005766) frisch (developer) 2011-01-13 18:43 |
I'm playing with a similar test: let ba () = let a = Array1.create float64 c_layout n in for j = 0 to m-1 do for i = 0 to n-1 do a.{i} <- 3. done done let arr () = let a = Array.create n 0. in for j = 0 to m-1 do for i = 0 to n-1 do a.(i) <- 3. done done I obtain: ba 1.62s arr 1.16s In the "ba" case, the inner loop is: .L103: movq %rdi, %rdx sarq $1, %rdx movq 8(%rax), %rsi movlpd .L106(%rip), %xmm0 movlpd %xmm0, (%rsi, %rdx, 8) movq %rdi, %rsi addq $2, %rdi cmpq $19999999, %rsi jne .L103 Several things can be improved with only a minor effect on performance: - load %xmm0 once, outside the loop; - avoid right-shifting of rdx, and use a multiplier of 4 instead of 8 (and an offset -4), as for the array case; - avoid copying rdi to rsi at the end, and compare rdi directly to 19999999 (the compiler knows there is no overflow with this constant upper-bound) The real win is obtained by lifting the extra pointer indirection related to bigarrays outside the loop: movq 8(%rax), %rcx .L103: movq %rdi, %rdx sarq $1, %rdx movlpd .L106(%rip), %xmm0 movlpd %xmm0, (%rcx, %rdx, 8) movq %rdi, %rsi addq $2, %rdi cmpq $19999999, %rsi jne .L103 With only this modification, we get the same performance as for regular arrays (this is not suprising, given that the inner loop is now the same). How difficult would it be to perform this optimization automatically? Is it specified that the "data" field of the caml_ba_array structure is immutable (it should not be modified by C code)? |
|
(0005767) frisch (developer) 2011-01-14 10:55 edited on: 2011-01-14 10:55 |
As far as I can say, one of the difficulties to implement this optimization (compared to unboxing of floats, for instance) is that we need to manipulate at the CMM level words which are not GC values (the pointer to the underlying data of a bigarray value) but which implicitly keep some GC value alive (the bigarray value). |
|
(0005768) Christophe Troestler (reporter) 2011-01-14 11:54 |
> Is it specified that the "data" field of the caml_ba_array structure is immutable I do not remember reading such a thing but I do not know either any library that changes the underlying data pointer after creating the bigarray. I believe lifting this indirection as much as possible outside blocks — especially loops — would be worthwhile because bigarrays get/set are of course usually mixed with loops (whose efficiency is important) and it would be nice to be as fast as C. |
|
(0005769) frisch (developer) 2011-01-14 13:44 edited on: 2011-01-14 13:45 |
I'm attaching a patch that tries to remove the extra indirection. An expression like: let a = E1 (* some bigarray expression *) in E2 (* some expression that loads or stores elements in a *) is compiled by extracting the data pointer from the bigarray once (after a is bound) and using this pointer in E2. Some care is needed to ensure that the bigarray value itself is not collected. I'm not 100% sure this patch is sound! On my example, I get the same performance for the array and bigarray versions. |
|
(0005770) frisch (developer) 2011-01-14 14:58 |
I forgot this chunk in the attached patch: Index: asmcomp/amd64/emit.mlp =================================================================== --- asmcomp/amd64/emit.mlp (revision 10927) +++ asmcomp/amd64/emit.mlp (working copy) @@ -335,7 +335,7 @@ let emit_instr fallthrough i = match i.desc with Lend -> () - | Lop(Imove | Ispill | Ireload) -> + | Lop(Imove | Ispill | Ireload | Iignore) -> let src = i.arg.(0) and dst = i.res.(0) in if src.loc <> dst.loc then begin if src.typ = Float then |
|
(0005771) Christophe Troestler (reporter) 2011-01-14 16:16 edited on: 2012-04-25 10:42 |
Here are my experiments. The patch has indeed a good impact. “set_ba” and “set_arr” are your tests, “ba” and “arr” are mine (creating the array in the function and with a “for j = 0 to m-1” loop around the loop over the array). I set n = 10_000 and m = 1000. ocaml 3.11.2 (unpatched):
Rate set_ba ba arr set_arr
set_ba 49.6+-0.1/s -- -30% -34% -50%
ba 71.2+-0.2/s 43% -- -5% -28%
arr 74.7+-0.1/s 50% 5% -- -25%
set_arr 99.0+-0.4/s 100% 39% 33% --
ocaml 3.12.1+dev5 (2010-10-12), no patch:
Rate set_ba ba arr set_arr
set_ba 49.7+-0.1/s -- -30% -50% -50%
ba 70.9+-0.4/s 43% -- -29% -29%
arr 99.3+-0.3/s 100% 40% -- [-0%]
set_arr 99.4+-0.2/s 100% 40% [0%] --
ocaml 3.12.1+dev5 (2010-10-12), after patch:
Rate ba set_ba set_arr arr
ba 71.9+-0.2/s -- [-2%] -26% -26%
set_ba 73.2+-1.5/s [2%] -- -25% -25%
set_arr 97.0+-1.0/s 35% 32% -- [-1%]
arr 97.8+-0.3/s 36% 33% [1%] --
|
|
(0005772) Christophe Troestler (reporter) 2011-01-14 16:27 |
Once (something like) this patch is determined sound, it would be good to perform the optimization also on “fun vec -> E2” (loops over bigarrays usually happen in functions computing things). |
|
(0005784) doligez (manager) 2011-01-20 10:46 |
You should make sure that the data pointer is not considered as a root by the GC. I looked at the patch but couldn't see anything related to this problem. This is because in the (not so far) future, it will be forbidden to have pointers outside the heap as OCaml values. |
|
(0005785) frisch (developer) 2011-01-20 11:32 |
I don't think there is a way, currently, to mark values at the cmm level as being non-GC pointers. Maybe this should be added at the same time pointers outside the heap are forbidden (do you have an idea when this will happen?). Also, the patch is unsafe in the presence of GADTs, so I assume it won't be included anyway in its current form. |
Issue History |
|||
| Date Modified | Username | Field | Change |
| 2011-01-13 03:42 | Christophe Troestler | New Issue | |
| 2011-01-13 18:43 | frisch | Note Added: 0005766 | |
| 2011-01-14 10:55 | frisch | Note Added: 0005767 | |
| 2011-01-14 10:55 | frisch | Note Edited: 0005767 | |
| 2011-01-14 11:54 | Christophe Troestler | Note Added: 0005768 | |
| 2011-01-14 13:41 | frisch | File Added: patch_5203 | |
| 2011-01-14 13:44 | frisch | Note Added: 0005769 | |
| 2011-01-14 13:45 | frisch | Note Edited: 0005769 | |
| 2011-01-14 14:58 | frisch | Note Added: 0005770 | |
| 2011-01-14 16:16 | Christophe Troestler | Note Added: 0005771 | |
| 2011-01-14 16:27 | Christophe Troestler | Note Added: 0005772 | |
| 2011-01-20 10:46 | doligez | Note Added: 0005784 | |
| 2011-01-20 10:46 | doligez | Status | new => acknowledged |
| 2011-01-20 11:32 | frisch | Note Added: 0005785 | |
| 2012-04-25 10:42 | frisch | Note Edited: 0005771 | View Revisions |
| 2012-06-21 20:10 | frisch | Category | OCaml general => OCaml backend (code generation) |
| Copyright © 2000 - 2011 MantisBT Group |