Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0005203OCamlOCaml backend (code generation)public2011-01-13 03:422013-10-08 15:21
ReporterChristophe Troestler 
Assigned To 
PlatformOSOS Version
Product Version3.12.1+dev 
Target VersionFixed in Version 
Summary0005203: Faster bigarray access
DescriptionBigarray access seems to be 20–30% slower than array access. For example,

type vec = (float, float64_elt, c_layout) Array1.t
let ba (a: vec) =
  let s = ref 0. in
  for i = 0 to n-1 do s := !s +. a.{i} done
let arr (a: float array) =
  let s = ref 0. in
  for i = 0 to n-1 do s := !s +. a.(i) done

ran on (big)arrays of size 10000 gives

       Rate ba arr
 ba 72519+-1524/s -- -26%
arr 97826+-1351/s 35% --

on an Intel core i7 Q 820 @ 1.73GHz (x86_64) running Debian GNU/Linux. Would it be possible to make access to (float) bigarrays as fast as (float) arrays when their type is statically known? That would be very much appreciated!
Attached Files? file icon patch_5203 [^] (5,586 bytes) 2011-01-14 13:41 [Show Content]

- Relationships

-  Notes
frisch (developer)
2011-01-13 18:43

I'm playing with a similar test:

let ba () =
  let a = Array1.create float64 c_layout n in
  for j = 0 to m-1 do
    for i = 0 to n-1 do a.{i} <- 3. done

let arr () =
  let a = Array.create n 0. in
  for j = 0 to m-1 do
    for i = 0 to n-1 do a.(i) <- 3. done

I obtain:

  ba 1.62s
  arr 1.16s

In the "ba" case, the inner loop is:

    movq %rdi, %rdx
    sarq $1, %rdx
    movq 8(%rax), %rsi
    movlpd .L106(%rip), %xmm0
    movlpd %xmm0, (%rsi, %rdx, 8)
    movq %rdi, %rsi
    addq $2, %rdi
    cmpq $19999999, %rsi
    jne .L103

Several things can be improved with only a minor effect on performance:
- load %xmm0 once, outside the loop;
- avoid right-shifting of rdx, and use a multiplier of 4 instead of 8 (and an offset -4), as for the array case;
- avoid copying rdi to rsi at the end, and compare rdi directly to 19999999 (the compiler knows there is no overflow with this constant upper-bound)

The real win is obtained by lifting the extra pointer indirection related to bigarrays outside the loop:

    movq 8(%rax), %rcx
    movq %rdi, %rdx
    sarq $1, %rdx
    movlpd .L106(%rip), %xmm0
    movlpd %xmm0, (%rcx, %rdx, 8)
    movq %rdi, %rsi
    addq $2, %rdi
    cmpq $19999999, %rsi
    jne .L103

With only this modification, we get the same performance as for regular arrays (this is not suprising, given that the inner loop is now the same).

How difficult would it be to perform this optimization automatically?

Is it specified that the "data" field of the caml_ba_array structure is immutable (it should not be modified by C code)?
frisch (developer)
2011-01-14 10:55
edited on: 2011-01-14 10:55

As far as I can say, one of the difficulties to implement this optimization (compared to unboxing of floats, for instance) is that we need to manipulate
at the CMM level words which are not GC values (the pointer to the underlying data of a bigarray value) but which implicitly keep some GC value alive (the bigarray value).

Christophe Troestler (reporter)
2011-01-14 11:54

> Is it specified that the "data" field of the caml_ba_array structure is immutable

I do not remember reading such a thing but I do not know either any library that changes the underlying data pointer after creating the bigarray.

I believe lifting this indirection as much as possible outside blocks — especially loops — would be worthwhile because bigarrays get/set are of course usually mixed with loops (whose efficiency is important) and it would be nice to be as fast as C.
frisch (developer)
2011-01-14 13:44
edited on: 2011-01-14 13:45

I'm attaching a patch that tries to remove the extra indirection. An expression

  let a = E1 (* some bigarray expression *) in
  E2 (* some expression that loads or stores elements in a *)

is compiled by extracting the data pointer from the bigarray once (after a is bound) and using this pointer in E2. Some care is needed to ensure that the bigarray value itself is not collected.

I'm not 100% sure this patch is sound!

On my example, I get the same performance for the array and bigarray versions.

frisch (developer)
2011-01-14 14:58

I forgot this chunk in the attached patch:

Index: asmcomp/amd64/emit.mlp
--- asmcomp/amd64/emit.mlp (revision 10927)
+++ asmcomp/amd64/emit.mlp (working copy)
@@ -335,7 +335,7 @@
 let emit_instr fallthrough i =
     match i.desc with
       Lend -> ()
- | Lop(Imove | Ispill | Ireload) ->
+ | Lop(Imove | Ispill | Ireload | Iignore) ->
         let src = i.arg.(0) and dst = i.res.(0) in
         if src.loc <> dst.loc then begin
           if src.typ = Float then
Christophe Troestler (reporter)
2011-01-14 16:16
edited on: 2012-04-25 10:42

Here are my experiments. The patch has indeed a good impact.

“set_ba” and “set_arr” are your tests, “ba” and “arr” are mine (creating the array in the function and with a “for j = 0 to m-1” loop around the loop over the array). I set n = 10_000 and m = 1000.

ocaml 3.11.2 (unpatched):
          Rate       set_ba      ba     arr set_arr
 set_ba 49.6+-0.1/s      --    -30%    -34%    -50%
     ba 71.2+-0.2/s     43%      --     -5%    -28%
    arr 74.7+-0.1/s     50%      5%      --    -25%
set_arr 99.0+-0.4/s    100%     39%     33%      --

ocaml 3.12.1+dev5 (2010-10-12), no patch:
          Rate       set_ba      ba     arr set_arr
 set_ba 49.7+-0.1/s      --    -30%    -50%    -50%
     ba 70.9+-0.4/s     43%      --    -29%    -29%
    arr 99.3+-0.3/s    100%     40%      --   [-0%]
set_arr 99.4+-0.2/s    100%     40%    [0%]      --

ocaml 3.12.1+dev5 (2010-10-12), after patch:
          Rate           ba  set_ba set_arr     arr
     ba 71.9+-0.2/s      --   [-2%]    -26%    -26%
 set_ba 73.2+-1.5/s    [2%]      --    -25%    -25%
set_arr 97.0+-1.0/s     35%     32%      --   [-1%]
    arr 97.8+-0.3/s     36%     33%    [1%]      --

Christophe Troestler (reporter)
2011-01-14 16:27

Once (something like) this patch is determined sound, it would be good to perform the optimization also on “fun vec -> E2” (loops over bigarrays usually happen in functions computing things).
doligez (administrator)
2011-01-20 10:46

You should make sure that the data pointer is not considered as a root by the GC. I looked at the patch but couldn't see anything related to this problem. This is because in the (not so far) future, it will be forbidden to have pointers outside the heap as OCaml values.
frisch (developer)
2011-01-20 11:32

I don't think there is a way, currently, to mark values at the cmm level as being non-GC pointers. Maybe this should be added at the same time pointers outside the heap are forbidden (do you have an idea when this will happen?).

Also, the patch is unsafe in the presence of GADTs, so I assume it won't be included anyway in its current form.

- Issue History
Date Modified Username Field Change
2011-01-13 03:42 Christophe Troestler New Issue
2011-01-13 18:43 frisch Note Added: 0005766
2011-01-14 10:55 frisch Note Added: 0005767
2011-01-14 10:55 frisch Note Edited: 0005767
2011-01-14 11:54 Christophe Troestler Note Added: 0005768
2011-01-14 13:41 frisch File Added: patch_5203
2011-01-14 13:44 frisch Note Added: 0005769
2011-01-14 13:45 frisch Note Edited: 0005769
2011-01-14 14:58 frisch Note Added: 0005770
2011-01-14 16:16 Christophe Troestler Note Added: 0005771
2011-01-14 16:27 Christophe Troestler Note Added: 0005772
2011-01-20 10:46 doligez Note Added: 0005784
2011-01-20 10:46 doligez Status new => acknowledged
2011-01-20 11:32 frisch Note Added: 0005785
2012-04-25 10:42 frisch Note Edited: 0005771 View Revisions
2012-06-21 20:10 frisch Category OCaml general => OCaml backend (code generation)
2013-10-08 15:21 doligez Tag Attached: patch

Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker