Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006125OCamlOCaml backend (code generation)public2013-08-06 20:032014-07-24 21:56
Reportervbrankov 
Assigned To 
PrioritynormalSeverityminorReproducibilityalways
StatusacknowledgedResolutionopen 
PlatformOSOS Version
Product Version4.00.1 
Target Versionafter-4.02.0Fixed in Version 
Summary0006125: Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic
DescriptionFrom Intel's manuals:

"Some instructions have an increased latency in IntelĀ® microarchitecture code name Sandy Bridge. Some LEA instructions, most notably three-operand LEA instructions, have increased latency and reduced dispatch port choices compared to other LEAs."

That's exactly what OCaml uses for integer arithmetic. Let's benchmark:

1. for j = 0 to len - 1 do
      s := !s + a.(j);
      s := !s + a.(j);
      s := !s + a.(j);
      s := !s + a.(j);
   done

2. for j = 0 to len - 1 do
      s := Int64.add !s a.(j);
      s := Int64.add !s a.(j);
      s := Int64.add !s a.(j);
      s := Int64.add !s a.(j);
   done

3. for j = 0 to len - 1 do
      s := !s + a.x;
      s := !s + a.x;
      s := !s + a.x;
      s := !s + a.x;
   done

4. for j = 0 to len - 1 do
      s := Int64.add !s a.y;
      s := Int64.add !s a.y;integer
      s := Int64.add !s a.y;
      s := Int64.add !s a.y;
   done

Nehalem:

1. 0.001237
2. 0.002479
3. 0.001237
4. 0.002452

vs. Sandy Bridge:

1. 0.004084
2. 0.001523
3. 0.004060
4. 0.001564

Ups, 63-bit code just got three times slower. 64-bit int got faster.

When we hack OCaml compiler to turn off LEA for integer arithmetic the situation gets better on Sandy Bridge.

Nehalem:

1. 0.002458
2. 0.002495
3. 0.002447
4. 0.002477

Sandy Bridge:

1. 0.002737
2. 0.001474
3. 0.002726
4. 0.001545

The only idea to tackle this, apart from switching to 64-bit integers, is to optionally turn off three operand LEA for integer arithmetic or to replace it with two operand LEA + int operation.
Tagspatch
Attached Files? file icon test.ml [^] (1,546 bytes) 2013-08-06 20:03 [Show Content]
txt file icon lea-patch.txt [^] (1,317 bytes) 2013-08-14 15:50 [Show Content]
? file icon lea-bench.ml [^] (1,008 bytes) 2013-08-14 15:51 [Show Content]
txt file icon lea-patch-2.txt [^] (1,316 bytes) 2013-08-14 17:29 [Show Content]

- Relationships

-  Notes
(0010148)
chambart (developer)
2013-08-07 21:59

This may be the case for this version of processors (it applies also to current AMD), but it may be completely different in the future iteration. Notice that gcc and llvm continue to use lea (even with the right -march option)
(0010162)
xleroy (administrator)
2013-08-12 17:19

Thanks for the info concerning Sandy Bridge. To reduce interference from other instructions generated by ocamlopt, I wrote similar tests directly in assembly. The results are below. N is a Nehalem-generation Xeon and SB a Sandy Bridge Xeon. The timings (brutally rounded) are normalized so that a single "addq" is 1.

N SB
1 1 addq y, x
1 1 lea (x, y), x
1 3 lea -1(x, y), x
2 2 addq y, x; decq x (simulates lea -1(x, y), x)
2 2 addq y, x; subq $1, x (ditto)
2 2 lea (x, y), z; decq z (simulates lea -1(x, y), z)
3 3 addq y, x; decq x; mov x, z (ditto)

So, yes, 3-operand lea is definitely slow on Sandy Bridge. Breaking the lea in two instructions is beneficial for SB, but not for Nehalem. Also, this could increase instruction decoding costs -- they don't show up in these tests because they are tight loops.

I'm unsure what to do about this. Especially since Intel's next generation might very well restore a fast lea. What about AMD chips?

(0010170)
vbrankov (reporter)
2013-08-13 16:52

If there are just two operands, you may want to consider this, which takes about one cycle on both Nehalem and Sandy Bridge (and also might be easy to implement in the compiler):

        movl %ecx, %edx
        xor $1, %edx
        addl %edx, %eax
(0010172)
vbrankov (reporter)
2013-08-13 17:00

Embarrasing, I used 32bit registers. The speed is the same with 64 bit ones.
(0010177)
vbrankov (reporter)
2013-08-14 16:12
edited on: 2013-08-14 16:17

I've patched the compiler and looks like this solution produces fast integer addition on both Sandy Bridge and Nehalem. As a nice addition, it almost doubles the speed of integer subtraction.

1:
  let rec loop x y =
    let x = x - y in
    let x = x - y in
    let x = x - y in
    let x = x - y in
    let x = x - y in
    let x = x - y in
    let x = x - y in
    let x = x - y in
    if x < 1000000000 then
      loop x y
  in
  loop 1 (-1);

2:
  let rec loop x y =
    let x = x + y in
    let x = x + y in
    let x = x + y in
    let x = x + y in
    let x = x + y in
    let x = x + y in
    let x = x + y in
    let x = x + y in

    if x < 1000000000 then
      loop x y
  in
  loop 1 1;

Nehalem no patch:
0.591611
0.290498

Nehalem with patch:
0.314693
0.314470

Sandy Bridge no patch:
0.650072
0.971930

Sandy Bridge with patch:
0.351132
0.351506

I've uploaded the proposed patch and the benchmark.

(0010179)
vbrankov (reporter)
2013-08-14 17:31

Please use lea-patch-2.txt instead of lea-patch.txt because it's more general.

To be precise, subtraction is faster because register dependency is broken and it's faster only when the dependency is on the first operand, which my guess happens more often.
(0010180)
xleroy (administrator)
2013-08-14 17:51

Watch out! In your first patch, the transformations x + y - 1 -> x + (y ^ 1) and x - y + 1 -> x - (y ^ 1) are not correct in general, but only when y is odd. (Consider x = y = 0.) So, you can't apply them blindly in selection.ml. cmmgen.ml might elect to use them when it knows that the arguments are 2n+1-encoded ints, but as we discussed previously it's not always a win depending on the target architecture.

(Besides, you're also using one extra temporary register than the lea code pattern, which can cause more moves or spills. You won't see this negative effect on microbenchmarks.)

Care to explain what your second patch does?
(0010181)
vbrankov (reporter)
2013-08-14 18:53

I'm sorry, I realized this but relied on detailed scrutiny of code reviewers. Let me take more time and try to explain this patch.

In its simplest form the patch does the following cmm replacement:

(+ (+ x y) const) -> (+ x (+ y const))
(+ (- x y) const) -> (- x (- y const))

Since it just reorders operations, I suppose it's correct. Overflows should not be a problem with addition, I can't tell about subtraction.

The assembly of a two variable addition without and with the patch is roughly:

#1: lea i(a, b), b
#2: mov a, d
    add i, d
    add d, b

"mov" should not be a problem for the instruction scheduler since it can do 6 instructions per cycle. The instruction decoder can do 4 simple instructions per cycle, so it may also not create additional load. The additional register's lifetime is very short and spilling may happen rarely enough not to outweigh the benefits. A set of benchmarks would definitely help get a better feeling of whether I'm wrong.

The instruction "add i, d" can execute before "b" is computed. Hence, in the benchmark the whole block has throughput of 1. LEA has throughput of 3 because it cannot start executing until all the operands are computed. This is also the explanation for the performance with the subtraction. Still, it may cause a decrease of performance on Nehalem when there's a dependency on "d". I don't have a definitive answer to this, maybe an explicit option be necessary.
(0010183)
vbrankov (reporter)
2013-08-15 14:46
edited on: 2013-08-15 14:57

I thought a little bit about the patch and for LEA it's not a clear win on anything except Sandy Bridge.

The part that deals with subtraction looks better on newer CPUs, but I'll move it to a separate thread when I get to it.

(0010533)
vbrankov (reporter)
2013-10-28 14:33

Somewhere along this thread there was a doubt whether it's worth working on this because the slow LEA may not persist in Intel CPUs after Sandy Bridge. This document shows that slow LEA exists in both subsequent generations of CPUs, Ivy Bridge and Haswell.

http://www.agner.org/optimize/instruction_tables.pdf [^]

- Issue History
Date Modified Username Field Change
2013-08-06 20:03 vbrankov New Issue
2013-08-06 20:03 vbrankov File Added: test.ml
2013-08-07 21:59 chambart Note Added: 0010148
2013-08-12 17:19 xleroy Note Added: 0010162
2013-08-12 17:19 xleroy Status new => acknowledged
2013-08-13 16:52 vbrankov Note Added: 0010170
2013-08-13 17:00 vbrankov Note Added: 0010172
2013-08-14 15:50 vbrankov File Added: lea-patch.txt
2013-08-14 15:51 vbrankov File Added: lea-bench.ml
2013-08-14 16:12 vbrankov Note Added: 0010177
2013-08-14 16:17 vbrankov Note Edited: 0010177 View Revisions
2013-08-14 16:17 vbrankov Note Edited: 0010177 View Revisions
2013-08-14 17:29 vbrankov File Added: lea-patch-2.txt
2013-08-14 17:31 vbrankov Note Added: 0010179
2013-08-14 17:51 xleroy Note Added: 0010180
2013-08-14 18:53 vbrankov Note Added: 0010181
2013-08-15 14:46 vbrankov Note Added: 0010183
2013-08-15 14:50 vbrankov Note Edited: 0010183 View Revisions
2013-08-15 14:57 vbrankov Note Edited: 0010183 View Revisions
2013-08-19 19:11 doligez Target Version => 4.01.1+dev
2013-10-28 14:33 vbrankov Note Added: 0010533
2014-01-17 17:47 doligez Tag Attached: patch
2014-05-25 20:20 doligez Target Version 4.01.1+dev => 4.02.0+dev
2014-07-24 21:56 doligez Target Version 4.02.0+dev => after-4.02.0


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker