"for" loop not entirely optimal? #5999

vicuna · 2013-04-24T18:54:39Z

Original bug ID: 5999
Reporter: vbrankov
Status: acknowledged (set by @xavierleroy on 2013-06-14T12:00:34Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.00.1
Target version: undecided
Category: back end (clambda to assembly)
Tags: patch
Child of: #2819
Monitored by: @gasche

Bug description

Why does the "for" loop create extra variable in cmmgen.ml? I simplified it and got a measurable speedup. I used binomial options pricing as benchmark and the speedup was between 0.3 and 3% depending on the version of Intel Xeon that I've tested it on.

Here's the patch:

Index: asmcomp/cmmgen.ml

--- asmcomp/cmmgen.ml (revision 13585)
+++ asmcomp/cmmgen.ml (working copy)
@@ -1243,7 +1243,6 @@
let tst = match dir with Upto -> Cgt | Downto -> Clt in
let inc = match dir with Upto -> Caddi | Downto -> Csubi in
let raise_num = next_raise_count () in

 let id_prev = Ident.rename id in
 return_unit
   (Clet
      (id, transl low,

@@ -1255,13 +1254,12 @@
Cloop
(Csequence
(remove_unit(transl body),

                    Clet(id_prev, Cvar id,
                     Csequence
                       (Cassign(id,
                          Cop(inc, [Cvar id; Cconst_int 2])),
                        Cifthenelse

                          (Cop(Ccmpi Ceq, [Cvar id_prev; high]),

                           Cexit (raise_num,[]), Ctuple [])))))),

                          (Cop(Ccmpi tst, [Cvar id; high]),

                           Cexit (raise_num,[]), Ctuple []))))),
            Ctuple []))))

| Uassign(id, exp) ->
return_unit(Cassign(id, transl exp))

The text was updated successfully, but these errors were encountered:

vicuna · 2013-04-25T14:27:28Z

Comment author: @alainfrisch

"svn annotate" is your friend:

http://caml.inria.fr/cgi-bin/viewvc.cgi?view=revision&revision=5277

vicuna · 2013-04-25T14:31:27Z

Comment author: vbrankov

Ups, sorry.

vicuna · 2013-04-25T14:58:14Z

Comment author: @xavierleroy

See #415. I suspect your patch reintroduces this bug, namely

for i = ... to max_int do ... done

fails to terminate. Could you check whether it is the case?

vicuna · 2013-04-25T15:30:01Z

Comment author: vbrankov

Yes, it does. How about this?

However, with this patch, although I think it generates simpler cmm code, I cannot reproduce the speedup. I'm checking whether I'm doing anything wrong.

Index: asmcomp/cmmgen.ml

--- asmcomp/cmmgen.ml (revision 13585)
+++ asmcomp/cmmgen.ml (working copy)
@@ -1243,7 +1243,6 @@
let tst = match dir with Upto -> Cgt | Downto -> Clt in
let inc = match dir with Upto -> Caddi | Downto -> Csubi in
let raise_num = next_raise_count () in

 let id_prev = Ident.rename id in
 return_unit
   (Clet
      (id, transl low,

@@ -1255,13 +1254,11 @@
Cloop
(Csequence
(remove_unit(transl body),

                    Clet(id_prev, Cvar id,

```
                     Csequence
```
```
                       (Cassign(id,
```

                          Cop(inc, [Cvar id; Cconst_int 2])),

```
                        Cifthenelse
```

                          (Cop(Ccmpi Ceq, [Cvar id_prev; high]),

                           Cexit (raise_num,[]), Ctuple [])))))),

```
                       Cifthenelse
```

                          (Cop(Ccmpi Ceq, [Cvar id; high]),

                           Cexit (raise_num,[]),

```
                          Cassign(id,
```

                           Cop(inc, [Cvar id; Cconst_int 2])))))),
            Ctuple []))))

| Uassign(id, exp) ->
return_unit(Cassign(id, transl exp))

vicuna · 2013-04-25T15:52:55Z

Comment author: vbrankov

Or we can change the "if" condition to Cne so that Cexit is in the second leg. I'm not sure whether that would have any benefits.

vicuna · 2013-04-26T16:39:52Z

Comment author: @xavierleroy

If I read your patch #2 correctly, you end up generating a loop with a conditional test in the middle and a jump at the end:

Lhead:
...
branch-if-finished Lafter
...
branch Lhead
Lafter:

instead of a loop with the conditional test at the bottom:

Lhead:
...
branch-if-not-finished Lhead

This doesn't look good to me.

What I had in mind was, rather, to precompute "max loop value plus 1 / minus 1" and compare the loop index after incrementation / decrementation against this precomputed value. I doubt this would make much of a performance difference wrt the code currently generated, though.

Concerning 3% speedups, see the recent discussion on caml-list: minute changes in code placement alone can result in much bigger slowdowns or speedups.

vicuna · 2013-04-26T17:35:17Z

Comment author: @alainfrisch

Concerning 3% speedups, see the recent discussion on caml-list: minute changes in code placement alone can result in much bigger slowdowns or speedups.

Gabriel Scherer collects a set of realistic benchmarks, which will hopefully make it easier to assess the benefits of such optimizations. While I agree that a 3% speedup on one example does not give any useful information, if we can observe a 3% speedup on a reasonable number of benchmarks and no degradation on the others, it is definitely a good idea to consider the optimization as a good candidate. Do you agree? (Then other criteria apply, of course, such as complexity of the compiler, possible interactions with other optimizations, etc)

vicuna · 2013-04-29T22:06:55Z

Comment author: vbrankov

I apologize for my somewhat messy and hastily written input. I agree with Xavier that I reversed the "if" conditions, and what I wondered in my previous comment is why after fixing it the code was not faster. I think I understand the reason and my new change produced a measurable speedup.

I validated the following change against this code:

let () =
for x = max_int - 3 to max_int do
Printf.printf "%d\n" x
done;
for x = min_int to min_int + 3 do
Printf.printf "%d\n" x
done;
for x = max_int downto max_int - 3 do
Printf.printf "%d\n" x
done;
for x = min_int + 3 downto min_int do
Printf.printf "%d\n" x
done;

I also validated the loop going from min_int to max_int and vice versa.

The change follows. I agree that it's not too much simpler and I understand the complications arising around max_int and min_int. It might have been simpler if "for" loop didn't include the last value :)

Index: asmcomp/cmmgen.ml

--- asmcomp/cmmgen.ml (revision 13585)
+++ asmcomp/cmmgen.ml (working copy)
@@ -1242,8 +1242,8 @@
| Ufor(id, low, high, dir, body) ->
let tst = match dir with Upto -> Cgt | Downto -> Clt in
let inc = match dir with Upto -> Caddi | Downto -> Csubi in

 let dec = match dir with Upto -> Csubi | Downto -> Caddi in
 let raise_num = next_raise_count () in

 let id_prev = Ident.rename id in
 return_unit
   (Clet
      (id, transl low,

@@ -1252,16 +1252,17 @@
(raise_num, [],
Cifthenelse
(Cop(Ccmpi tst, [Cvar id; high]), Cexit (raise_num, []),

```
               Cloop
```
```
                 (Csequence
```

                    (remove_unit(transl body),

                    Clet(id_prev, Cvar id,

```
                     Csequence
```

```
               Csequence
```

                 (Cassign(id, Cop(dec, [Cvar id; Cconst_int 2])),

```
                  Cloop
```

                    (Csequence
                       (Cassign(id,
                          Cop(inc, [Cvar id; Cconst_int 2])),

```
                        Cifthenelse
```

                          (Cop(Ccmpi Ceq, [Cvar id_prev; high]),

                           Cexit (raise_num,[]), Ctuple [])))))),

```
                        Csequence
```

                          (remove_unit(transl body),

```
                           Cifthenelse
```

                             (Cop(Ccmpi Ceq, [Cvar id; high]),

                              Cexit (raise_num,[]), Ctuple [])))))),
            Ctuple []))))

| Uassign(id, exp) ->
return_unit(Cassign(id, transl exp))

I tried these simple benchmarks and the results show a speedup. You can decide whether it's worth trying on your benchmarks which I'm sure are much better.

for1.ml:

let () =
for j = 1 to 16 do
let t = Unix.gettimeofday () in
let s = ref 0 in
for i = 0 to 256 * 1024 * 1024 do
s := !s + i
done;
Printf.printf "%f\n%!" (Unix.gettimeofday () -. t)
done

for2.ml:

let () =
for j = 1 to 16 do
let t = Unix.gettimeofday () in
for i = 0 to 256 * 1024 * 1024 do
let _ = () in ()
done;
Printf.printf "%f\n%!" (Unix.gettimeofday () -. t)
done

binomial.ml:

let print x =
Printf.printf "%f\n" x

let binomial =
let n = 31 in
let up_pow = Array.make (2 * n + 1) 0. in
let p = Array.make (n + 1) 0. in
fun t s k r sigma q ->
let delta_t = t /. (float_of_int n) in
let up = exp (sigma *. (sqrt delta_t)) in
let e = exp ((q -. r) *. delta_t) in
let p0 = (up -. e) /. (up *. up -. 1.) in
let p1 = e -. p0 in

let upp = ref 1. in
for i = 0 to n do
  up_pow.(n - i) <- 1. /. !upp;
  up_pow.(n + i) <- !upp;
  upp := !upp *. up
done;

for i = 0 to n do
  p.(i) <- max (k -. s *. up_pow.(2 * i)) 0.
done;

for j = n - 1 downto 0 do
  for i = 0 to j do
    p.(i) <- max (p0 *. p.(i + 1) +. p1 *. p.(i))
                 (k -. s *. up_pow.(2 * i - j + n))
  done
done;

p.(0)

let () =
for j = 1 to 16 do
let t = Unix.gettimeofday () in
for i = 1 to 16 * 1024 do
let _ = binomial 2. 51. 52. 0.05 0.30 0. in ()
done;
Printf.printf "%f\n%!" (Unix.gettimeofday () -. t)
done

Sandy Bridge E5-2678W:

$ ./for1o.exe
0.260819
0.260948
0.260801
0.260866
0.260827
0.260850
0.260839
0.260838
0.260805
0.260932
0.260837
0.260867
0.260853
0.260815
0.260880
0.260835
$ ./for1n.exe
0.260984
0.260826
0.260842
0.260806
0.260820
0.260875
0.260818
0.260833
0.260931
0.260856
0.260883
0.260831
0.260840
0.260823
0.260818
0.260916
$ ./for2o.exe
0.086952
0.086954
0.087016
0.086974
0.086926
0.086920
0.086960
0.086930
0.086975
0.086945
0.086949
0.086933
0.086921
0.086993
0.087036
0.086943
$ ./for2n.exe
0.163828
0.163704
0.163343
0.163945
0.163630
0.163942
0.163502
0.163694
0.163925
0.163866
0.163601
0.163771
0.164038
0.163980
0.163916
0.163751
$ ./binomialo.exe
0.207450
0.206657
0.206742
0.206671
0.206671
0.206698
0.206661
0.206738
0.206699
0.206717
0.206709
0.206715
0.206755
0.206704
0.206725
0.206675
$ ./binomialn.exe
0.213871
0.213428
0.213373
0.213341
0.213416
0.213362
0.213369
0.213388
0.213402
0.213394
0.213379
0.213426
0.213356
0.213453
0.213383
0.213424

Results on Nehalem W5580:

$ ./for1n.exe
0.174947
0.174892
0.174715
0.174037
0.174908
0.174869
0.175077
0.174716
0.174957
0.174934
0.173908
0.174368
0.174649
0.173923
0.174939
0.174617
$ ./for1o.exe
0.155822
0.156458
0.155934
0.155775
0.155727
0.155996
0.155818
0.155730
0.155991
0.155727
0.155741
0.155731
0.155782
0.155853
0.155822
0.156264
$ ./for2n.exe
0.156385
0.155981
0.156028
0.156326
0.156335
0.159161
0.155775
0.155884
0.155790
0.155858
0.155726
0.155789
0.155730
0.155765
0.155715
0.156170
$ ./for2o.exe
0.156091
0.156449
0.155923
0.155901
0.155962
0.155920
0.155927
0.155922
0.156145
0.156003
0.155985
0.155826
0.155922
0.155877
0.156098
0.155893
$ ./binomialn.exe
0.195871
0.197264
0.197959
0.198080
0.197767
0.197309
0.197652
0.197516
0.197584
0.197424
0.197679
0.197659
0.198057
0.197310
0.197322
0.197594
$ ./binomialo.exe
0.197047
0.194971
0.195267
0.197578
0.197261
0.197287
0.197569
0.197292
0.197625
0.197437
0.197564
0.197575
0.198512
0.197781
0.197244
0.197192

vicuna · 2013-06-13T16:49:54Z

Comment author: vbrankov

I'm sorry, I see that this issue stalled. Do I need to do something about it?

vicuna · 2013-06-14T12:00:34Z

Comment author: @xavierleroy

Do I need to do something about it?

No, thanks for asking. We're in code freeze for release 4.01, so this will have to wait for later.

github-actions · 2020-05-15T04:20:19Z

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

vicuna added the back-end label Mar 14, 2019

vicuna mentioned this issue Apr 25, 2013

for i = 0 to max_int do () done loop ? #2819

Closed

vicuna added the feature-wish label Mar 20, 2019

github-actions bot added the Stale label May 15, 2020

github-actions bot closed this as completed Jun 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"for" loop not entirely optimal? #5999

"for" loop not entirely optimal? #5999

vicuna commented Apr 24, 2013

vicuna commented Apr 25, 2013

vicuna commented Apr 25, 2013

vicuna commented Apr 25, 2013

vicuna commented Apr 25, 2013

vicuna commented Apr 25, 2013

vicuna commented Apr 26, 2013

vicuna commented Apr 26, 2013

vicuna commented Apr 29, 2013

vicuna commented Jun 13, 2013

vicuna commented Jun 14, 2013

github-actions bot commented May 15, 2020