Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"for" loop not entirely optimal? #5999

Closed
vicuna opened this issue Apr 24, 2013 · 11 comments
Closed

"for" loop not entirely optimal? #5999

vicuna opened this issue Apr 24, 2013 · 11 comments

Comments

@vicuna
Copy link

vicuna commented Apr 24, 2013

Original bug ID: 5999
Reporter: vbrankov
Status: acknowledged (set by @xavierleroy on 2013-06-14T12:00:34Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.00.1
Target version: undecided
Category: back end (clambda to assembly)
Tags: patch
Child of: #2819
Monitored by: @gasche

Bug description

Why does the "for" loop create extra variable in cmmgen.ml? I simplified it and got a measurable speedup. I used binomial options pricing as benchmark and the speedup was between 0.3 and 3% depending on the version of Intel Xeon that I've tested it on.

Here's the patch:

Index: asmcomp/cmmgen.ml

--- asmcomp/cmmgen.ml (revision 13585)
+++ asmcomp/cmmgen.ml (working copy)
@@ -1243,7 +1243,6 @@
let tst = match dir with Upto -> Cgt | Downto -> Clt in
let inc = match dir with Upto -> Caddi | Downto -> Csubi in
let raise_num = next_raise_count () in

  •  let id_prev = Ident.rename id in
     return_unit
       (Clet
          (id, transl low,
    

@@ -1255,13 +1254,12 @@
Cloop
(Csequence
(remove_unit(transl body),

  •                     Clet(id_prev, Cvar id,
                         Csequence
                           (Cassign(id,
                              Cop(inc, [Cvar id; Cconst_int 2])),
                            Cifthenelse
    
  •                           (Cop(Ccmpi Ceq, [Cvar id_prev; high]),
    
  •                            Cexit (raise_num,[]), Ctuple [])))))),
    
  •                           (Cop(Ccmpi tst, [Cvar id; high]),
    
  •                            Cexit (raise_num,[]), Ctuple []))))),
                Ctuple []))))
    
    | Uassign(id, exp) ->
    return_unit(Cassign(id, transl exp))
@vicuna
Copy link
Author

vicuna commented Apr 25, 2013

Comment author: @alainfrisch

"svn annotate" is your friend:

http://caml.inria.fr/cgi-bin/viewvc.cgi?view=revision&revision=5277

@vicuna
Copy link
Author

vicuna commented Apr 25, 2013

Comment author: vbrankov

Ups, sorry.

@vicuna
Copy link
Author

vicuna commented Apr 25, 2013

Comment author: @xavierleroy

See #415. I suspect your patch reintroduces this bug, namely

for i = ... to max_int do ... done

fails to terminate. Could you check whether it is the case?

@vicuna
Copy link
Author

vicuna commented Apr 25, 2013

Comment author: vbrankov

Yes, it does. How about this?

However, with this patch, although I think it generates simpler cmm code, I cannot reproduce the speedup. I'm checking whether I'm doing anything wrong.

Index: asmcomp/cmmgen.ml

--- asmcomp/cmmgen.ml (revision 13585)
+++ asmcomp/cmmgen.ml (working copy)
@@ -1243,7 +1243,6 @@
let tst = match dir with Upto -> Cgt | Downto -> Clt in
let inc = match dir with Upto -> Caddi | Downto -> Csubi in
let raise_num = next_raise_count () in

  •  let id_prev = Ident.rename id in
     return_unit
       (Clet
          (id, transl low,
    

@@ -1255,13 +1254,11 @@
Cloop
(Csequence
(remove_unit(transl body),

  •                     Clet(id_prev, Cvar id,
    
  •                      Csequence
    
  •                        (Cassign(id,
    
  •                           Cop(inc, [Cvar id; Cconst_int 2])),
    
  •                         Cifthenelse
    
  •                           (Cop(Ccmpi Ceq, [Cvar id_prev; high]),
    
  •                            Cexit (raise_num,[]), Ctuple [])))))),
    
  •                        Cifthenelse
    
  •                           (Cop(Ccmpi Ceq, [Cvar id; high]),
    
  •                            Cexit (raise_num,[]),
    
  •                           Cassign(id,
    
  •                            Cop(inc, [Cvar id; Cconst_int 2])))))),
                Ctuple []))))
    
    | Uassign(id, exp) ->
    return_unit(Cassign(id, transl exp))

@vicuna
Copy link
Author

vicuna commented Apr 25, 2013

Comment author: vbrankov

Or we can change the "if" condition to Cne so that Cexit is in the second leg. I'm not sure whether that would have any benefits.

@vicuna
Copy link
Author

vicuna commented Apr 26, 2013

Comment author: @xavierleroy

If I read your patch #2 correctly, you end up generating a loop with a conditional test in the middle and a jump at the end:

Lhead:
...
branch-if-finished Lafter
...
branch Lhead
Lafter:

instead of a loop with the conditional test at the bottom:

Lhead:
...
branch-if-not-finished Lhead

This doesn't look good to me.

What I had in mind was, rather, to precompute "max loop value plus 1 / minus 1" and compare the loop index after incrementation / decrementation against this precomputed value. I doubt this would make much of a performance difference wrt the code currently generated, though.

Concerning 3% speedups, see the recent discussion on caml-list: minute changes in code placement alone can result in much bigger slowdowns or speedups.

@vicuna
Copy link
Author

vicuna commented Apr 26, 2013

Comment author: @alainfrisch

Concerning 3% speedups, see the recent discussion on caml-list: minute changes in code placement alone can result in much bigger slowdowns or speedups.

Gabriel Scherer collects a set of realistic benchmarks, which will hopefully make it easier to assess the benefits of such optimizations. While I agree that a 3% speedup on one example does not give any useful information, if we can observe a 3% speedup on a reasonable number of benchmarks and no degradation on the others, it is definitely a good idea to consider the optimization as a good candidate. Do you agree? (Then other criteria apply, of course, such as complexity of the compiler, possible interactions with other optimizations, etc)

@vicuna
Copy link
Author

vicuna commented Apr 29, 2013

Comment author: vbrankov

I apologize for my somewhat messy and hastily written input. I agree with Xavier that I reversed the "if" conditions, and what I wondered in my previous comment is why after fixing it the code was not faster. I think I understand the reason and my new change produced a measurable speedup.

I validated the following change against this code:

let () =
for x = max_int - 3 to max_int do
Printf.printf "%d\n" x
done;
for x = min_int to min_int + 3 do
Printf.printf "%d\n" x
done;
for x = max_int downto max_int - 3 do
Printf.printf "%d\n" x
done;
for x = min_int + 3 downto min_int do
Printf.printf "%d\n" x
done;

I also validated the loop going from min_int to max_int and vice versa.

The change follows. I agree that it's not too much simpler and I understand the complications arising around max_int and min_int. It might have been simpler if "for" loop didn't include the last value :)

Index: asmcomp/cmmgen.ml

--- asmcomp/cmmgen.ml (revision 13585)
+++ asmcomp/cmmgen.ml (working copy)
@@ -1242,8 +1242,8 @@
| Ufor(id, low, high, dir, body) ->
let tst = match dir with Upto -> Cgt | Downto -> Clt in
let inc = match dir with Upto -> Caddi | Downto -> Csubi in

  •  let dec = match dir with Upto -> Csubi | Downto -> Caddi in
     let raise_num = next_raise_count () in
    
  •  let id_prev = Ident.rename id in
     return_unit
       (Clet
          (id, transl low,
    

@@ -1252,16 +1252,17 @@
(raise_num, [],
Cifthenelse
(Cop(Ccmpi tst, [Cvar id; high]), Cexit (raise_num, []),

  •                Cloop
    
  •                  (Csequence
    
  •                     (remove_unit(transl body),
    
  •                     Clet(id_prev, Cvar id,
    
  •                      Csequence
    
  •                Csequence
    
  •                  (Cassign(id, Cop(dec, [Cvar id; Cconst_int 2])),
    
  •                   Cloop
    
  •                     (Csequence
                           (Cassign(id,
                              Cop(inc, [Cvar id; Cconst_int 2])),
    
  •                         Cifthenelse
    
  •                           (Cop(Ccmpi Ceq, [Cvar id_prev; high]),
    
  •                            Cexit (raise_num,[]), Ctuple [])))))),
    
  •                         Csequence
    
  •                           (remove_unit(transl body),
    
  •                            Cifthenelse
    
  •                              (Cop(Ccmpi Ceq, [Cvar id; high]),
    
  •                               Cexit (raise_num,[]), Ctuple [])))))),
                Ctuple []))))
    
    | Uassign(id, exp) ->
    return_unit(Cassign(id, transl exp))

I tried these simple benchmarks and the results show a speedup. You can decide whether it's worth trying on your benchmarks which I'm sure are much better.

for1.ml:

let () =
for j = 1 to 16 do
let t = Unix.gettimeofday () in
let s = ref 0 in
for i = 0 to 256 * 1024 * 1024 do
s := !s + i
done;
Printf.printf "%f\n%!" (Unix.gettimeofday () -. t)
done

for2.ml:

let () =
for j = 1 to 16 do
let t = Unix.gettimeofday () in
for i = 0 to 256 * 1024 * 1024 do
let _ = () in ()
done;
Printf.printf "%f\n%!" (Unix.gettimeofday () -. t)
done

binomial.ml:

let print x =
Printf.printf "%f\n" x

let binomial =
let n = 31 in
let up_pow = Array.make (2 * n + 1) 0. in
let p = Array.make (n + 1) 0. in
fun t s k r sigma q ->
let delta_t = t /. (float_of_int n) in
let up = exp (sigma *. (sqrt delta_t)) in
let e = exp ((q -. r) *. delta_t) in
let p0 = (up -. e) /. (up *. up -. 1.) in
let p1 = e -. p0 in

let upp = ref 1. in
for i = 0 to n do
  up_pow.(n - i) <- 1. /. !upp;
  up_pow.(n + i) <- !upp;
  upp := !upp *. up
done;

for i = 0 to n do
  p.(i) <- max (k -. s *. up_pow.(2 * i)) 0.
done;

for j = n - 1 downto 0 do
  for i = 0 to j do
    p.(i) <- max (p0 *. p.(i + 1) +. p1 *. p.(i))
                 (k -. s *. up_pow.(2 * i - j + n))
  done
done;

p.(0)

let () =
for j = 1 to 16 do
let t = Unix.gettimeofday () in
for i = 1 to 16 * 1024 do
let _ = binomial 2. 51. 52. 0.05 0.30 0. in ()
done;
Printf.printf "%f\n%!" (Unix.gettimeofday () -. t)
done

Sandy Bridge E5-2678W:

$ ./for1o.exe
0.260819
0.260948
0.260801
0.260866
0.260827
0.260850
0.260839
0.260838
0.260805
0.260932
0.260837
0.260867
0.260853
0.260815
0.260880
0.260835
$ ./for1n.exe
0.260984
0.260826
0.260842
0.260806
0.260820
0.260875
0.260818
0.260833
0.260931
0.260856
0.260883
0.260831
0.260840
0.260823
0.260818
0.260916
$ ./for2o.exe
0.086952
0.086954
0.087016
0.086974
0.086926
0.086920
0.086960
0.086930
0.086975
0.086945
0.086949
0.086933
0.086921
0.086993
0.087036
0.086943
$ ./for2n.exe
0.163828
0.163704
0.163343
0.163945
0.163630
0.163942
0.163502
0.163694
0.163925
0.163866
0.163601
0.163771
0.164038
0.163980
0.163916
0.163751
$ ./binomialo.exe
0.207450
0.206657
0.206742
0.206671
0.206671
0.206698
0.206661
0.206738
0.206699
0.206717
0.206709
0.206715
0.206755
0.206704
0.206725
0.206675
$ ./binomialn.exe
0.213871
0.213428
0.213373
0.213341
0.213416
0.213362
0.213369
0.213388
0.213402
0.213394
0.213379
0.213426
0.213356
0.213453
0.213383
0.213424

Results on Nehalem W5580:

$ ./for1n.exe
0.174947
0.174892
0.174715
0.174037
0.174908
0.174869
0.175077
0.174716
0.174957
0.174934
0.173908
0.174368
0.174649
0.173923
0.174939
0.174617
$ ./for1o.exe
0.155822
0.156458
0.155934
0.155775
0.155727
0.155996
0.155818
0.155730
0.155991
0.155727
0.155741
0.155731
0.155782
0.155853
0.155822
0.156264
$ ./for2n.exe
0.156385
0.155981
0.156028
0.156326
0.156335
0.159161
0.155775
0.155884
0.155790
0.155858
0.155726
0.155789
0.155730
0.155765
0.155715
0.156170
$ ./for2o.exe
0.156091
0.156449
0.155923
0.155901
0.155962
0.155920
0.155927
0.155922
0.156145
0.156003
0.155985
0.155826
0.155922
0.155877
0.156098
0.155893
$ ./binomialn.exe
0.195871
0.197264
0.197959
0.198080
0.197767
0.197309
0.197652
0.197516
0.197584
0.197424
0.197679
0.197659
0.198057
0.197310
0.197322
0.197594
$ ./binomialo.exe
0.197047
0.194971
0.195267
0.197578
0.197261
0.197287
0.197569
0.197292
0.197625
0.197437
0.197564
0.197575
0.198512
0.197781
0.197244
0.197192

@vicuna
Copy link
Author

vicuna commented Jun 13, 2013

Comment author: vbrankov

I'm sorry, I see that this issue stalled. Do I need to do something about it?

@vicuna
Copy link
Author

vicuna commented Jun 14, 2013

Comment author: @xavierleroy

Do I need to do something about it?

No, thanks for asking. We're in code freeze for release 4.01, so this will have to wait for later.

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant