Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster bigarray access #5203

Closed
vicuna opened this issue Jan 13, 2011 · 10 comments
Closed

Faster bigarray access #5203

vicuna opened this issue Jan 13, 2011 · 10 comments

Comments

@vicuna
Copy link

vicuna commented Jan 13, 2011

Original bug ID: 5203
Reporter: @Chris00
Status: acknowledged (set by @damiendoligez on 2011-01-20T09:46:52Z)
Resolution: open
Priority: normal
Severity: feature
Version: 3.12.1+dev
Category: back end (clambda to assembly)
Tags: patch
Monitored by: @ygrek @hcarty @Chris00 @alainfrisch

Bug description

Bigarray access seems to be 20–30% slower than array access. For example,

type vec = (float, float64_elt, c_layout) Array1.t
let ba (a: vec) =
let s = ref 0. in
for i = 0 to n-1 do s := !s +. a.{i} done
let arr (a: float array) =
let s = ref 0. in
for i = 0 to n-1 do s := !s +. a.(i) done

ran on (big)arrays of size 10000 gives

   Rate         ba  arr

ba 72519+-1524/s -- -26%
arr 97826+-1351/s 35% --

on an Intel core i7 Q 820 @ 1.73GHz (x86_64) running Debian GNU/Linux. Would it be possible to make access to (float) bigarrays as fast as (float) arrays when their type is statically known? That would be very much appreciated!

File attachments

@vicuna
Copy link
Author

vicuna commented Jan 13, 2011

Comment author: @alainfrisch

I'm playing with a similar test:

let ba () =
let a = Array1.create float64 c_layout n in
for j = 0 to m-1 do
for i = 0 to n-1 do a.{i} <- 3. done
done

let arr () =
let a = Array.create n 0. in
for j = 0 to m-1 do
for i = 0 to n-1 do a.(i) <- 3. done
done

I obtain:

ba 1.62s
arr 1.16s

In the "ba" case, the inner loop is:

.L103:
movq %rdi, %rdx
sarq $1, %rdx
movq 8(%rax), %rsi
movlpd .L106(%rip), %xmm0
movlpd %xmm0, (%rsi, %rdx, 8)
movq %rdi, %rsi
addq $2, %rdi
cmpq $19999999, %rsi
jne .L103

Several things can be improved with only a minor effect on performance:

  • load %xmm0 once, outside the loop;
  • avoid right-shifting of rdx, and use a multiplier of 4 instead of 8 (and an offset -4), as for the array case;
  • avoid copying rdi to rsi at the end, and compare rdi directly to 19999999 (the compiler knows there is no overflow with this constant upper-bound)

The real win is obtained by lifting the extra pointer indirection related to bigarrays outside the loop:

movq	8(%rax), %rcx

.L103:
movq %rdi, %rdx
sarq $1, %rdx
movlpd .L106(%rip), %xmm0
movlpd %xmm0, (%rcx, %rdx, 8)
movq %rdi, %rsi
addq $2, %rdi
cmpq $19999999, %rsi
jne .L103

With only this modification, we get the same performance as for regular arrays (this is not suprising, given that the inner loop is now the same).

How difficult would it be to perform this optimization automatically?

Is it specified that the "data" field of the caml_ba_array structure is immutable (it should not be modified by C code)?

@vicuna
Copy link
Author

vicuna commented Jan 14, 2011

Comment author: @alainfrisch

As far as I can say, one of the difficulties to implement this optimization (compared to unboxing of floats, for instance) is that we need to manipulate
at the CMM level words which are not GC values (the pointer to the underlying data of a bigarray value) but which implicitly keep some GC value alive (the bigarray value).

@vicuna
Copy link
Author

vicuna commented Jan 14, 2011

Comment author: @Chris00

Is it specified that the "data" field of the caml_ba_array structure is immutable

I do not remember reading such a thing but I do not know either any library that changes the underlying data pointer after creating the bigarray.

I believe lifting this indirection as much as possible outside blocks — especially loops — would be worthwhile because bigarrays get/set are of course usually mixed with loops (whose efficiency is important) and it would be nice to be as fast as C.

@vicuna
Copy link
Author

vicuna commented Jan 14, 2011

Comment author: @alainfrisch

I'm attaching a patch that tries to remove the extra indirection. An expression
like:

let a = E1 (* some bigarray expression ) in
E2 (
some expression that loads or stores elements in a *)

is compiled by extracting the data pointer from the bigarray once (after a is bound) and using this pointer in E2. Some care is needed to ensure that the bigarray value itself is not collected.

I'm not 100% sure this patch is sound!

On my example, I get the same performance for the array and bigarray versions.

@vicuna
Copy link
Author

vicuna commented Jan 14, 2011

Comment author: @alainfrisch

I forgot this chunk in the attached patch:

Index: asmcomp/amd64/emit.mlp

--- asmcomp/amd64/emit.mlp (revision 10927)
+++ asmcomp/amd64/emit.mlp (working copy)
@@ -335,7 +335,7 @@
let emit_instr fallthrough i =
match i.desc with
Lend -> ()

  • | Lop(Imove | Ispill | Ireload) ->
  • | Lop(Imove | Ispill | Ireload | Iignore) ->
    let src = i.arg.(0) and dst = i.res.(0) in
    if src.loc <> dst.loc then begin
    if src.typ = Float then

@vicuna
Copy link
Author

vicuna commented Jan 14, 2011

Comment author: @Chris00

Here are my experiments. The patch has indeed a good impact.

“set_ba” and “set_arr” are your tests, “ba” and “arr” are mine (creating the array in the function and with a “for j = 0 to m-1” loop around the loop over the array). I set n = 10_000 and m = 1000.

ocaml 3.11.2 (unpatched):

          Rate       set_ba      ba     arr set_arr
 set_ba 49.6+-0.1/s      --    -30%    -34%    -50%
     ba 71.2+-0.2/s     43%      --     -5%    -28%
    arr 74.7+-0.1/s     50%      5%      --    -25%
set_arr 99.0+-0.4/s    100%     39%     33%      --

ocaml 3.12.1+dev5 (2010-10-12), no patch:

          Rate       set_ba      ba     arr set_arr
 set_ba 49.7+-0.1/s      --    -30%    -50%    -50%
     ba 70.9+-0.4/s     43%      --    -29%    -29%
    arr 99.3+-0.3/s    100%     40%      --   [-0%]
set_arr 99.4+-0.2/s    100%     40%    [0%]      --

ocaml 3.12.1+dev5 (2010-10-12), after patch:

          Rate           ba  set_ba set_arr     arr
     ba 71.9+-0.2/s      --   [-2%]    -26%    -26%
 set_ba 73.2+-1.5/s    [2%]      --    -25%    -25%
set_arr 97.0+-1.0/s     35%     32%      --   [-1%]
    arr 97.8+-0.3/s     36%     33%    [1%]      --

@vicuna
Copy link
Author

vicuna commented Jan 14, 2011

Comment author: @Chris00

Once (something like) this patch is determined sound, it would be good to perform the optimization also on “fun vec -> E2” (loops over bigarrays usually happen in functions computing things).

@vicuna
Copy link
Author

vicuna commented Jan 20, 2011

Comment author: @damiendoligez

You should make sure that the data pointer is not considered as a root by the GC. I looked at the patch but couldn't see anything related to this problem. This is because in the (not so far) future, it will be forbidden to have pointers outside the heap as OCaml values.

@vicuna
Copy link
Author

vicuna commented Jan 20, 2011

Comment author: @alainfrisch

I don't think there is a way, currently, to mark values at the cmm level as being non-GC pointers. Maybe this should be added at the same time pointers outside the heap are forbidden (do you have an idea when this will happen?).

Also, the patch is unsafe in the presence of GADTs, so I assume it won't be included anyway in its current form.

@xavierleroy
Copy link
Contributor

The CSE optimization that was added in 4.02 can often factor out the extra indirection caused by bigarrays. But it cannot lift it outside loops, like this report suggests. Other code motion optimizations could be considered. But I'd rather have a generic optimization than something specific to bigarrays. So, I'll close this report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants