Anonymous | Login | Signup for a new account 2016-09-28 00:18 CEST
 Main | My View | View Issues | Change Log | Roadmap

View Issue Details  Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006125OCamlOcaml optimizationpublic2013-08-06 20:032015-12-03 13:50
Reportervbrankov
Assigned To
PrioritynormalSeverityminorReproducibilityalways
StatusacknowledgedResolutionopen
PlatformOSOS Version
Product Version4.00.1
Target VersionlaterFixed in Version
Summary0006125: Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic
DescriptionFrom Intel's manuals:

"Some instructions have an increased latency in IntelĀ® microarchitecture code name Sandy Bridge. Some LEA instructions, most notably three-operand LEA instructions, have increased latency and reduced dispatch port choices compared to other LEAs."

That's exactly what OCaml uses for integer arithmetic. Let's benchmark:

1. for j = 0 to len - 1 do
s := !s + a.(j);
s := !s + a.(j);
s := !s + a.(j);
s := !s + a.(j);
done

2. for j = 0 to len - 1 do
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
done

3. for j = 0 to len - 1 do
s := !s + a.x;
s := !s + a.x;
s := !s + a.x;
s := !s + a.x;
done

4. for j = 0 to len - 1 do
s := Int64.add !s a.y;
s := Int64.add !s a.y;integer
s := Int64.add !s a.y;
s := Int64.add !s a.y;
done

Nehalem:

1. 0.001237
2. 0.002479
3. 0.001237
4. 0.002452

vs. Sandy Bridge:

1. 0.004084
2. 0.001523
3. 0.004060
4. 0.001564

Ups, 63-bit code just got three times slower. 64-bit int got faster.

When we hack OCaml compiler to turn off LEA for integer arithmetic the situation gets better on Sandy Bridge.

Nehalem:

1. 0.002458
2. 0.002495
3. 0.002447
4. 0.002477

Sandy Bridge:

1. 0.002737
2. 0.001474
3. 0.002726
4. 0.001545

The only idea to tackle this, apart from switching to 64-bit integers, is to optionally turn off three operand LEA for integer arithmetic or to replace it with two operand LEA + int operation.
Tagspatch
Attached Files test.ml [^] (1,546 bytes) 2013-08-06 20:03
lea-patch.txt [^] (1,317 bytes) 2013-08-14 15:50
lea-bench.ml [^] (1,008 bytes) 2013-08-14 15:51
lea-patch-2.txt [^] (1,316 bytes) 2013-08-14 17:29

 Relationships

 Notes chambart (developer) 2013-08-07 21:59 This may be the case for this version of processors (it applies also to current AMD), but it may be completely different in the future iteration. Notice that gcc and llvm continue to use lea (even with the right -march option) xleroy (administrator) 2013-08-12 17:19 Thanks for the info concerning Sandy Bridge. To reduce interference from other instructions generated by ocamlopt, I wrote similar tests directly in assembly. The results are below. N is a Nehalem-generation Xeon and SB a Sandy Bridge Xeon. The timings (brutally rounded) are normalized so that a single "addq" is 1. N SB 1 1 addq y, x 1 1 lea (x, y), x 1 3 lea -1(x, y), x 2 2 addq y, x; decq x (simulates lea -1(x, y), x) 2 2 addq y, x; subq \$1, x (ditto) 2 2 lea (x, y), z; decq z (simulates lea -1(x, y), z) 3 3 addq y, x; decq x; mov x, z (ditto) So, yes, 3-operand lea is definitely slow on Sandy Bridge. Breaking the lea in two instructions is beneficial for SB, but not for Nehalem. Also, this could increase instruction decoding costs -- they don't show up in these tests because they are tight loops. I'm unsure what to do about this. Especially since Intel's next generation might very well restore a fast lea. What about AMD chips? vbrankov (reporter) 2013-08-13 16:52 If there are just two operands, you may want to consider this, which takes about one cycle on both Nehalem and Sandy Bridge (and also might be easy to implement in the compiler):         movl %ecx, %edx         xor \$1, %edx         addl %edx, %eax vbrankov (reporter) 2013-08-13 17:00 Embarrasing, I used 32bit registers. The speed is the same with 64 bit ones. vbrankov (reporter) 2013-08-14 16:12 edited on: 2013-08-14 16:17 I've patched the compiler and looks like this solution produces fast integer addition on both Sandy Bridge and Nehalem. As a nice addition, it almost doubles the speed of integer subtraction. 1:   let rec loop x y =     let x = x - y in     let x = x - y in     let x = x - y in     let x = x - y in     let x = x - y in     let x = x - y in     let x = x - y in     let x = x - y in     if x < 1000000000 then       loop x y   in   loop 1 (-1); 2:   let rec loop x y =     let x = x + y in     let x = x + y in     let x = x + y in     let x = x + y in     let x = x + y in     let x = x + y in     let x = x + y in     let x = x + y in     if x < 1000000000 then       loop x y   in   loop 1 1; Nehalem no patch: 0.591611 0.290498 Nehalem with patch: 0.314693 0.314470 Sandy Bridge no patch: 0.650072 0.971930 Sandy Bridge with patch: 0.351132 0.351506 I've uploaded the proposed patch and the benchmark. vbrankov (reporter) 2013-08-14 17:31 Please use lea-patch-2.txt instead of lea-patch.txt because it's more general. To be precise, subtraction is faster because register dependency is broken and it's faster only when the dependency is on the first operand, which my guess happens more often. xleroy (administrator) 2013-08-14 17:51 Watch out! In your first patch, the transformations x + y - 1 -> x + (y ^ 1) and x - y + 1 -> x - (y ^ 1) are not correct in general, but only when y is odd. (Consider x = y = 0.) So, you can't apply them blindly in selection.ml. cmmgen.ml might elect to use them when it knows that the arguments are 2n+1-encoded ints, but as we discussed previously it's not always a win depending on the target architecture. (Besides, you're also using one extra temporary register than the lea code pattern, which can cause more moves or spills. You won't see this negative effect on microbenchmarks.) Care to explain what your second patch does? vbrankov (reporter) 2013-08-14 18:53 I'm sorry, I realized this but relied on detailed scrutiny of code reviewers. Let me take more time and try to explain this patch. In its simplest form the patch does the following cmm replacement: (+ (+ x y) const) -> (+ x (+ y const)) (+ (- x y) const) -> (- x (- y const)) Since it just reorders operations, I suppose it's correct. Overflows should not be a problem with addition, I can't tell about subtraction. The assembly of a two variable addition without and with the patch is roughly: #1: lea i(a, b), b #2: mov a, d     add i, d     add d, b "mov" should not be a problem for the instruction scheduler since it can do 6 instructions per cycle. The instruction decoder can do 4 simple instructions per cycle, so it may also not create additional load. The additional register's lifetime is very short and spilling may happen rarely enough not to outweigh the benefits. A set of benchmarks would definitely help get a better feeling of whether I'm wrong. The instruction "add i, d" can execute before "b" is computed. Hence, in the benchmark the whole block has throughput of 1. LEA has throughput of 3 because it cannot start executing until all the operands are computed. This is also the explanation for the performance with the subtraction. Still, it may cause a decrease of performance on Nehalem when there's a dependency on "d". I don't have a definitive answer to this, maybe an explicit option be necessary. vbrankov (reporter) 2013-08-15 14:46 edited on: 2013-08-15 14:57 I thought a little bit about the patch and for LEA it's not a clear win on anything except Sandy Bridge. The part that deals with subtraction looks better on newer CPUs, but I'll move it to a separate thread when I get to it. vbrankov (reporter) 2013-10-28 14:33 Somewhere along this thread there was a doubt whether it's worth working on this because the slow LEA may not persist in Intel CPUs after Sandy Bridge. This document shows that slow LEA exists in both subsequent generations of CPUs, Ivy Bridge and Haswell. http://www.agner.org/optimize/instruction_tables.pdf [^]

 Issue History Date Modified Username Field Change 2013-08-06 20:03 vbrankov New Issue 2013-08-06 20:03 vbrankov File Added: test.ml 2013-08-07 21:59 chambart Note Added: 0010148 2013-08-12 17:19 xleroy Note Added: 0010162 2013-08-12 17:19 xleroy Status new => acknowledged 2013-08-13 16:52 vbrankov Note Added: 0010170 2013-08-13 17:00 vbrankov Note Added: 0010172 2013-08-14 15:50 vbrankov File Added: lea-patch.txt 2013-08-14 15:51 vbrankov File Added: lea-bench.ml 2013-08-14 16:12 vbrankov Note Added: 0010177 2013-08-14 16:17 vbrankov Note Edited: 0010177 View Revisions 2013-08-14 16:17 vbrankov Note Edited: 0010177 View Revisions 2013-08-14 17:29 vbrankov File Added: lea-patch-2.txt 2013-08-14 17:31 vbrankov Note Added: 0010179 2013-08-14 17:51 xleroy Note Added: 0010180 2013-08-14 18:53 vbrankov Note Added: 0010181 2013-08-15 14:46 vbrankov Note Added: 0010183 2013-08-15 14:50 vbrankov Note Edited: 0010183 View Revisions 2013-08-15 14:57 vbrankov Note Edited: 0010183 View Revisions 2013-08-19 19:11 doligez Target Version => 4.01.1+dev 2013-10-28 14:33 vbrankov Note Added: 0010533 2014-01-17 17:47 doligez Tag Attached: patch 2014-05-25 20:20 doligez Target Version 4.01.1+dev => 4.02.0+dev 2014-07-24 21:56 doligez Target Version 4.02.0+dev => 4.02.1+dev 2014-09-04 00:25 doligez Target Version 4.02.1+dev => undecided 2014-09-24 00:01 doligez Category OCaml backend (code generation) => Ocaml optimization 2014-09-24 00:01 doligez Target Version undecided => 4.02.2+dev / +rc1 2015-01-16 23:50 doligez Target Version 4.02.2+dev / +rc1 => 4.03.0+dev / +beta1 2015-12-03 13:50 frisch Target Version 4.03.0+dev / +beta1 => later

 Copyright © 2000 - 2011 MantisBT Group