Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic #6125

Closed
vicuna opened this issue Aug 6, 2013 · 15 comments
Closed

Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic #6125

vicuna opened this issue Aug 6, 2013 · 15 comments

Comments

@vicuna
Copy link

vicuna commented Aug 6, 2013

Original bug ID: 6125
Reporter: vbrankov
Status: acknowledged (set by @xavierleroy on 2013-08-12T15:19:39Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.00.1
Category: back end (clambda to assembly)
Tags: optimization, patch
Monitored by: @gasche @diml @ygrek @yakobowski @xavierleroy

Bug description

From Intel's manuals:

"Some instructions have an increased latency in Intel® microarchitecture code name Sandy Bridge. Some LEA instructions, most notably three-operand LEA instructions, have increased latency and reduced dispatch port choices compared to other LEAs."

That's exactly what OCaml uses for integer arithmetic. Let's benchmark:

  1. for j = 0 to len - 1 do
    s := !s + a.(j);
    s := !s + a.(j);
    s := !s + a.(j);
    s := !s + a.(j);
    done

  2. for j = 0 to len - 1 do
    s := Int64.add !s a.(j);
    s := Int64.add !s a.(j);
    s := Int64.add !s a.(j);
    s := Int64.add !s a.(j);
    done

  3. for j = 0 to len - 1 do
    s := !s + a.x;
    s := !s + a.x;
    s := !s + a.x;
    s := !s + a.x;
    done

  4. for j = 0 to len - 1 do
    s := Int64.add !s a.y;
    s := Int64.add !s a.y;integer
    s := Int64.add !s a.y;
    s := Int64.add !s a.y;
    done

Nehalem:

  1. 0.001237
  2. 0.002479
  3. 0.001237
  4. 0.002452

vs. Sandy Bridge:

  1. 0.004084
  2. 0.001523
  3. 0.004060
  4. 0.001564

Ups, 63-bit code just got three times slower. 64-bit int got faster.

When we hack OCaml compiler to turn off LEA for integer arithmetic the situation gets better on Sandy Bridge.

Nehalem:

  1. 0.002458
  2. 0.002495
  3. 0.002447
  4. 0.002477

Sandy Bridge:

  1. 0.002737
  2. 0.001474
  3. 0.002726
  4. 0.001545

The only idea to tackle this, apart from switching to 64-bit integers, is to optionally turn off three operand LEA for integer arithmetic or to replace it with two operand LEA + int operation.

File attachments

@vicuna
Copy link
Author

vicuna commented Aug 7, 2013

Comment author: @chambart

This may be the case for this version of processors (it applies also to current AMD), but it may be completely different in the future iteration. Notice that gcc and llvm continue to use lea (even with the right -march option)

@vicuna
Copy link
Author

vicuna commented Aug 12, 2013

Comment author: @xavierleroy

Thanks for the info concerning Sandy Bridge. To reduce interference from other instructions generated by ocamlopt, I wrote similar tests directly in assembly. The results are below. N is a Nehalem-generation Xeon and SB a Sandy Bridge Xeon. The timings (brutally rounded) are normalized so that a single "addq" is 1.

N SB
1 1 addq y, x
1 1 lea (x, y), x
1 3 lea -1(x, y), x
2 2 addq y, x; decq x (simulates lea -1(x, y), x)
2 2 addq y, x; subq $1, x (ditto)
2 2 lea (x, y), z; decq z (simulates lea -1(x, y), z)
3 3 addq y, x; decq x; mov x, z (ditto)

So, yes, 3-operand lea is definitely slow on Sandy Bridge. Breaking the lea in two instructions is beneficial for SB, but not for Nehalem. Also, this could increase instruction decoding costs -- they don't show up in these tests because they are tight loops.

I'm unsure what to do about this. Especially since Intel's next generation might very well restore a fast lea. What about AMD chips?

@vicuna
Copy link
Author

vicuna commented Aug 13, 2013

Comment author: vbrankov

If there are just two operands, you may want to consider this, which takes about one cycle on both Nehalem and Sandy Bridge (and also might be easy to implement in the compiler):

    movl    %ecx, %edx
    xor     $1, %edx
    addl    %edx, %eax

@vicuna
Copy link
Author

vicuna commented Aug 13, 2013

Comment author: vbrankov

Embarrasing, I used 32bit registers. The speed is the same with 64 bit ones.

@vicuna
Copy link
Author

vicuna commented Aug 14, 2013

Comment author: vbrankov

I've patched the compiler and looks like this solution produces fast integer addition on both Sandy Bridge and Nehalem. As a nice addition, it almost doubles the speed of integer subtraction.

1:
let rec loop x y =
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
if x < 1000000000 then
loop x y
in
loop 1 (-1);

2:
let rec loop x y =
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in

if x < 1000000000 then
  loop x y

in
loop 1 1;

Nehalem no patch:
0.591611
0.290498

Nehalem with patch:
0.314693
0.314470

Sandy Bridge no patch:
0.650072
0.971930

Sandy Bridge with patch:
0.351132
0.351506

I've uploaded the proposed patch and the benchmark.

@vicuna
Copy link
Author

vicuna commented Aug 14, 2013

Comment author: vbrankov

Please use lea-patch-2.txt instead of lea-patch.txt because it's more general.

To be precise, subtraction is faster because register dependency is broken and it's faster only when the dependency is on the first operand, which my guess happens more often.

@vicuna
Copy link
Author

vicuna commented Aug 14, 2013

Comment author: @xavierleroy

Watch out! In your first patch, the transformations x + y - 1 -> x + (y ^ 1) and x - y + 1 -> x - (y ^ 1) are not correct in general, but only when y is odd. (Consider x = y = 0.) So, you can't apply them blindly in selection.ml. cmmgen.ml might elect to use them when it knows that the arguments are 2n+1-encoded ints, but as we discussed previously it's not always a win depending on the target architecture.

(Besides, you're also using one extra temporary register than the lea code pattern, which can cause more moves or spills. You won't see this negative effect on microbenchmarks.)

Care to explain what your second patch does?

@vicuna
Copy link
Author

vicuna commented Aug 14, 2013

Comment author: vbrankov

I'm sorry, I realized this but relied on detailed scrutiny of code reviewers. Let me take more time and try to explain this patch.

In its simplest form the patch does the following cmm replacement:

(+ (+ x y) const) -> (+ x (+ y const))
(+ (- x y) const) -> (- x (- y const))

Since it just reorders operations, I suppose it's correct. Overflows should not be a problem with addition, I can't tell about subtraction.

The assembly of a two variable addition without and with the patch is roughly:

#1: lea i(a, b), b
#2: mov a, d
add i, d
add d, b

"mov" should not be a problem for the instruction scheduler since it can do 6 instructions per cycle. The instruction decoder can do 4 simple instructions per cycle, so it may also not create additional load. The additional register's lifetime is very short and spilling may happen rarely enough not to outweigh the benefits. A set of benchmarks would definitely help get a better feeling of whether I'm wrong.

The instruction "add i, d" can execute before "b" is computed. Hence, in the benchmark the whole block has throughput of 1. LEA has throughput of 3 because it cannot start executing until all the operands are computed. This is also the explanation for the performance with the subtraction. Still, it may cause a decrease of performance on Nehalem when there's a dependency on "d". I don't have a definitive answer to this, maybe an explicit option be necessary.

@vicuna
Copy link
Author

vicuna commented Aug 15, 2013

Comment author: vbrankov

I thought a little bit about the patch and for LEA it's not a clear win on anything except Sandy Bridge.

The part that deals with subtraction looks better on newer CPUs, but I'll move it to a separate thread when I get to it.

@vicuna
Copy link
Author

vicuna commented Oct 28, 2013

Comment author: vbrankov

Somewhere along this thread there was a doubt whether it's worth working on this because the slow LEA may not persist in Intel CPUs after Sandy Bridge. This document shows that slow LEA exists in both subsequent generations of CPUs, Ivy Bridge and Haswell.

http://www.agner.org/optimize/instruction_tables.pdf

@vicuna
Copy link
Author

vicuna commented Dec 8, 2016

Comment author: @mshinwell

@xLeroy Do you have an opinion on this given that the slow LEA appears to be persisting (according to the reference above, at least)?

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @damiendoligez

La discussion se termine par une question pour toi.

@vicuna
Copy link
Author

vicuna commented Feb 24, 2017

Comment author: @xavierleroy

I re-did my measurements on more recent Intel processors, see below. 3-argument LEA remains slower. But I'm still not convinced that going from 3-cycle LEA to a 2-cycle larger instruction sequence is a win.

NH SB IB SK
1 1 1 1 addq y, x
1 1 1 1 lea (x, y), x
1 3 3 3 lea -1(x, y), x
2 2 2 2 addq y, x; decq x (simulates lea -1(x, y), x)
2 2 2 2 addq y, x; subq $1, x (ditto)
2 2 2 2 lea (x, y), z; decq z (simulates lea -1(x, y), z)
3 3 2 2 addq y, x; decq x; mov x, z (ditto)

NH = Nehalem Xeon X3430
SB = Sandy Bridge Xeon E5-1620
IB = Ivy Bridge Core i7-3770
SK = Skylake Xeon E3-1270 v5

@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

@github-actions github-actions bot added the Stale label May 15, 2020
@stedolan stedolan removed the Stale label May 20, 2020
@github-actions
Copy link

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants