Faster bigarray access #5203

vicuna · 2011-01-13T02:42:25Z

Original bug ID: 5203
Reporter: @Chris00
Status: acknowledged (set by @damiendoligez on 2011-01-20T09:46:52Z)
Resolution: open
Priority: normal
Severity: feature
Version: 3.12.1+dev
Category: back end (clambda to assembly)
Tags: patch
Monitored by: @ygrek @hcarty @Chris00 @alainfrisch

Bug description

Bigarray access seems to be 20–30% slower than array access. For example,

type vec = (float, float64_elt, c_layout) Array1.t
let ba (a: vec) =
let s = ref 0. in
for i = 0 to n-1 do s := !s +. a.{i} done
let arr (a: float array) =
let s = ref 0. in
for i = 0 to n-1 do s := !s +. a.(i) done

ran on (big)arrays of size 10000 gives

   Rate         ba  arr

ba 72519+-1524/s -- -26%
arr 97826+-1351/s 35% --

on an Intel core i7 Q 820 @ 1.73GHz (x86_64) running Debian GNU/Linux. Would it be possible to make access to (float) bigarrays as fast as (float) arrays when their type is statically known? That would be very much appreciated!

File attachments

patch_5203

The text was updated successfully, but these errors were encountered:

vicuna · 2011-01-13T17:43:06Z

Comment author: @alainfrisch

I'm playing with a similar test:

let ba () =
let a = Array1.create float64 c_layout n in
for j = 0 to m-1 do
for i = 0 to n-1 do a.{i} <- 3. done
done

let arr () =
let a = Array.create n 0. in
for j = 0 to m-1 do
for i = 0 to n-1 do a.(i) <- 3. done
done

I obtain:

ba 1.62s
arr 1.16s

In the "ba" case, the inner loop is:

.L103:
movq %rdi, %rdx
sarq $1, %rdx
movq 8(%rax), %rsi
movlpd .L106(%rip), %xmm0
movlpd %xmm0, (%rsi, %rdx, 8)
movq %rdi, %rsi
addq $2, %rdi
cmpq $19999999, %rsi
jne .L103

Several things can be improved with only a minor effect on performance:

load %xmm0 once, outside the loop;
avoid right-shifting of rdx, and use a multiplier of 4 instead of 8 (and an offset -4), as for the array case;
avoid copying rdi to rsi at the end, and compare rdi directly to 19999999 (the compiler knows there is no overflow with this constant upper-bound)

The real win is obtained by lifting the extra pointer indirection related to bigarrays outside the loop:

movq	8(%rax), %rcx

.L103:
movq %rdi, %rdx
sarq $1, %rdx
movlpd .L106(%rip), %xmm0
movlpd %xmm0, (%rcx, %rdx, 8)
movq %rdi, %rsi
addq $2, %rdi
cmpq $19999999, %rsi
jne .L103

With only this modification, we get the same performance as for regular arrays (this is not suprising, given that the inner loop is now the same).

How difficult would it be to perform this optimization automatically?

Is it specified that the "data" field of the caml_ba_array structure is immutable (it should not be modified by C code)?

vicuna · 2011-01-14T09:55:01Z

Comment author: @alainfrisch

As far as I can say, one of the difficulties to implement this optimization (compared to unboxing of floats, for instance) is that we need to manipulate
at the CMM level words which are not GC values (the pointer to the underlying data of a bigarray value) but which implicitly keep some GC value alive (the bigarray value).

vicuna · 2011-01-14T10:54:42Z

Comment author: @Chris00

Is it specified that the "data" field of the caml_ba_array structure is immutable

I do not remember reading such a thing but I do not know either any library that changes the underlying data pointer after creating the bigarray.

I believe lifting this indirection as much as possible outside blocks — especially loops — would be worthwhile because bigarrays get/set are of course usually mixed with loops (whose efficiency is important) and it would be nice to be as fast as C.

vicuna · 2011-01-14T12:44:37Z

Comment author: @alainfrisch

I'm attaching a patch that tries to remove the extra indirection. An expression
like:

let a = E1 (* some bigarray expression ) in
E2 ( some expression that loads or stores elements in a *)

is compiled by extracting the data pointer from the bigarray once (after a is bound) and using this pointer in E2. Some care is needed to ensure that the bigarray value itself is not collected.

I'm not 100% sure this patch is sound!

On my example, I get the same performance for the array and bigarray versions.

vicuna · 2011-01-14T13:58:40Z

Comment author: @alainfrisch

I forgot this chunk in the attached patch:

Index: asmcomp/amd64/emit.mlp

--- asmcomp/amd64/emit.mlp (revision 10927)
+++ asmcomp/amd64/emit.mlp (working copy)
@@ -335,7 +335,7 @@
let emit_instr fallthrough i =
match i.desc with
Lend -> ()

| Lop(Imove | Ispill | Ireload) ->

| Lop(Imove | Ispill | Ireload | Iignore) ->
let src = i.arg.(0) and dst = i.res.(0) in
if src.loc <> dst.loc then begin
if src.typ = Float then

vicuna · 2011-01-14T15:16:05Z

Comment author: @Chris00

Here are my experiments. The patch has indeed a good impact.

“set_ba” and “set_arr” are your tests, “ba” and “arr” are mine (creating the array in the function and with a “for j = 0 to m-1” loop around the loop over the array). I set n = 10_000 and m = 1000.

ocaml 3.11.2 (unpatched):

          Rate       set_ba      ba     arr set_arr
 set_ba 49.6+-0.1/s      --    -30%    -34%    -50%
     ba 71.2+-0.2/s     43%      --     -5%    -28%
    arr 74.7+-0.1/s     50%      5%      --    -25%
set_arr 99.0+-0.4/s    100%     39%     33%      --

ocaml 3.12.1+dev5 (2010-10-12), no patch:

          Rate       set_ba      ba     arr set_arr
 set_ba 49.7+-0.1/s      --    -30%    -50%    -50%
     ba 70.9+-0.4/s     43%      --    -29%    -29%
    arr 99.3+-0.3/s    100%     40%      --   [-0%]
set_arr 99.4+-0.2/s    100%     40%    [0%]      --

ocaml 3.12.1+dev5 (2010-10-12), after patch:

          Rate           ba  set_ba set_arr     arr
     ba 71.9+-0.2/s      --   [-2%]    -26%    -26%
 set_ba 73.2+-1.5/s    [2%]      --    -25%    -25%
set_arr 97.0+-1.0/s     35%     32%      --   [-1%]
    arr 97.8+-0.3/s     36%     33%    [1%]      --

vicuna · 2011-01-14T15:27:03Z

Comment author: @Chris00

Once (something like) this patch is determined sound, it would be good to perform the optimization also on “fun vec -> E2” (loops over bigarrays usually happen in functions computing things).

vicuna · 2011-01-20T09:46:38Z

Comment author: @damiendoligez

You should make sure that the data pointer is not considered as a root by the GC. I looked at the patch but couldn't see anything related to this problem. This is because in the (not so far) future, it will be forbidden to have pointers outside the heap as OCaml values.

vicuna · 2011-01-20T10:32:37Z

Comment author: @alainfrisch

I don't think there is a way, currently, to mark values at the cmm level as being non-GC pointers. Maybe this should be added at the same time pointers outside the heap are forbidden (do you have an idea when this will happen?).

Also, the patch is unsafe in the presence of GADTs, so I assume it won't be included anyway in its current form.

xavierleroy · 2019-03-17T16:04:00Z

The CSE optimization that was added in 4.02 can often factor out the extra indirection caused by bigarrays. But it cannot lift it outside loops, like this report suggests. Other code motion optimizations could be considered. But I'd rather have a generic optimization than something specific to bigarrays. So, I'll close this report.

vicuna added the back-end label Mar 14, 2019

xavierleroy closed this as completed Mar 17, 2019

vicuna added the feature-wish label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster bigarray access #5203

Faster bigarray access #5203

vicuna commented Jan 13, 2011

vicuna commented Jan 13, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 20, 2011

vicuna commented Jan 20, 2011

xavierleroy commented Mar 17, 2019

Faster bigarray access #5203

Faster bigarray access #5203

Comments

vicuna commented Jan 13, 2011

Bug description

File attachments

vicuna commented Jan 13, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

Index: asmcomp/amd64/emit.mlp

vicuna commented Jan 14, 2011

vicuna commented Jan 14, 2011

vicuna commented Jan 20, 2011

vicuna commented Jan 20, 2011

xavierleroy commented Mar 17, 2019