Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic #6125

vicuna · 2013-08-06T18:03:13Z

Original bug ID: 6125
Reporter: vbrankov
Status: acknowledged (set by @xavierleroy on 2013-08-12T15:19:39Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.00.1
Category: back end (clambda to assembly)
Tags: optimization, patch
Monitored by: @gasche @diml @ygrek @yakobowski @xavierleroy

Bug description

From Intel's manuals:

"Some instructions have an increased latency in Intel® microarchitecture code name Sandy Bridge. Some LEA instructions, most notably three-operand LEA instructions, have increased latency and reduced dispatch port choices compared to other LEAs."

That's exactly what OCaml uses for integer arithmetic. Let's benchmark:

for j = 0 to len - 1 do
s := !s + a.(j);
s := !s + a.(j);
s := !s + a.(j);
s := !s + a.(j);
done
for j = 0 to len - 1 do
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
done
for j = 0 to len - 1 do
s := !s + a.x;
s := !s + a.x;
s := !s + a.x;
s := !s + a.x;
done
for j = 0 to len - 1 do
s := Int64.add !s a.y;
s := Int64.add !s a.y;integer
s := Int64.add !s a.y;
s := Int64.add !s a.y;
done

Nehalem:

0.001237
0.002479
0.001237
0.002452

vs. Sandy Bridge:

0.004084
0.001523
0.004060
0.001564

Ups, 63-bit code just got three times slower. 64-bit int got faster.

When we hack OCaml compiler to turn off LEA for integer arithmetic the situation gets better on Sandy Bridge.

Nehalem:

0.002458
0.002495
0.002447
0.002477

Sandy Bridge:

0.002737
0.001474
0.002726
0.001545

The only idea to tackle this, apart from switching to 64-bit integers, is to optionally turn off three operand LEA for integer arithmetic or to replace it with two operand LEA + int operation.

File attachments

vicuna · 2013-08-07T19:59:43Z

Comment author: @chambart

This may be the case for this version of processors (it applies also to current AMD), but it may be completely different in the future iteration. Notice that gcc and llvm continue to use lea (even with the right -march option)

vicuna · 2013-08-12T15:19:39Z

Comment author: @xavierleroy

Thanks for the info concerning Sandy Bridge. To reduce interference from other instructions generated by ocamlopt, I wrote similar tests directly in assembly. The results are below. N is a Nehalem-generation Xeon and SB a Sandy Bridge Xeon. The timings (brutally rounded) are normalized so that a single "addq" is 1.

N SB
1 1 addq y, x
1 1 lea (x, y), x
1 3 lea -1(x, y), x
2 2 addq y, x; decq x (simulates lea -1(x, y), x)
2 2 addq y, x; subq $1, x (ditto)
2 2 lea (x, y), z; decq z (simulates lea -1(x, y), z)
3 3 addq y, x; decq x; mov x, z (ditto)

So, yes, 3-operand lea is definitely slow on Sandy Bridge. Breaking the lea in two instructions is beneficial for SB, but not for Nehalem. Also, this could increase instruction decoding costs -- they don't show up in these tests because they are tight loops.

I'm unsure what to do about this. Especially since Intel's next generation might very well restore a fast lea. What about AMD chips?

vicuna · 2013-08-13T14:52:06Z

Comment author: vbrankov

If there are just two operands, you may want to consider this, which takes about one cycle on both Nehalem and Sandy Bridge (and also might be easy to implement in the compiler):

    movl    %ecx, %edx
    xor     $1, %edx
    addl    %edx, %eax

vicuna · 2013-08-13T15:00:03Z

Comment author: vbrankov

Embarrasing, I used 32bit registers. The speed is the same with 64 bit ones.

vicuna · 2013-08-14T14:12:33Z

Comment author: vbrankov

I've patched the compiler and looks like this solution produces fast integer addition on both Sandy Bridge and Nehalem. As a nice addition, it almost doubles the speed of integer subtraction.

1:
let rec loop x y =
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
let x = x - y in
if x < 1000000000 then
loop x y
in
loop 1 (-1);

2:
let rec loop x y =
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in
let x = x + y in

if x < 1000000000 then
  loop x y

in
loop 1 1;

Nehalem no patch:
0.591611
0.290498

Nehalem with patch:
0.314693
0.314470

Sandy Bridge no patch:
0.650072
0.971930

Sandy Bridge with patch:
0.351132
0.351506

I've uploaded the proposed patch and the benchmark.

vicuna · 2013-08-14T15:31:57Z

Comment author: vbrankov

Please use lea-patch-2.txt instead of lea-patch.txt because it's more general.

To be precise, subtraction is faster because register dependency is broken and it's faster only when the dependency is on the first operand, which my guess happens more often.

vicuna · 2013-08-14T15:51:21Z

Comment author: @xavierleroy

Watch out! In your first patch, the transformations x + y - 1 -> x + (y ^ 1) and x - y + 1 -> x - (y ^ 1) are not correct in general, but only when y is odd. (Consider x = y = 0.) So, you can't apply them blindly in selection.ml. cmmgen.ml might elect to use them when it knows that the arguments are 2n+1-encoded ints, but as we discussed previously it's not always a win depending on the target architecture.

(Besides, you're also using one extra temporary register than the lea code pattern, which can cause more moves or spills. You won't see this negative effect on microbenchmarks.)

Care to explain what your second patch does?

vicuna · 2013-08-14T16:53:08Z

Comment author: vbrankov

I'm sorry, I realized this but relied on detailed scrutiny of code reviewers. Let me take more time and try to explain this patch.

In its simplest form the patch does the following cmm replacement:

(+ (+ x y) const) -> (+ x (+ y const))
(+ (- x y) const) -> (- x (- y const))

Since it just reorders operations, I suppose it's correct. Overflows should not be a problem with addition, I can't tell about subtraction.

The assembly of a two variable addition without and with the patch is roughly:

#1: lea i(a, b), b
#2: mov a, d
add i, d
add d, b

"mov" should not be a problem for the instruction scheduler since it can do 6 instructions per cycle. The instruction decoder can do 4 simple instructions per cycle, so it may also not create additional load. The additional register's lifetime is very short and spilling may happen rarely enough not to outweigh the benefits. A set of benchmarks would definitely help get a better feeling of whether I'm wrong.

The instruction "add i, d" can execute before "b" is computed. Hence, in the benchmark the whole block has throughput of 1. LEA has throughput of 3 because it cannot start executing until all the operands are computed. This is also the explanation for the performance with the subtraction. Still, it may cause a decrease of performance on Nehalem when there's a dependency on "d". I don't have a definitive answer to this, maybe an explicit option be necessary.

vicuna · 2013-08-15T12:46:51Z

Comment author: vbrankov

I thought a little bit about the patch and for LEA it's not a clear win on anything except Sandy Bridge.

The part that deals with subtraction looks better on newer CPUs, but I'll move it to a separate thread when I get to it.

vicuna · 2013-10-28T13:33:28Z

Comment author: vbrankov

Somewhere along this thread there was a doubt whether it's worth working on this because the slow LEA may not persist in Intel CPUs after Sandy Bridge. This document shows that slow LEA exists in both subsequent generations of CPUs, Ivy Bridge and Haswell.

http://www.agner.org/optimize/instruction_tables.pdf

vicuna · 2016-12-08T14:42:08Z

Comment author: @mshinwell

@xLeroy Do you have an opinion on this given that the slow LEA appears to be persisting (according to the reference above, at least)?

vicuna · 2017-02-24T15:13:34Z

Comment author: @damiendoligez

La discussion se termine par une question pour toi.

vicuna · 2017-02-24T17:46:41Z

Comment author: @xavierleroy

I re-did my measurements on more recent Intel processors, see below. 3-argument LEA remains slower. But I'm still not convinced that going from 3-cycle LEA to a 2-cycle larger instruction sequence is a win.

NH SB IB SK
1 1 1 1 addq y, x
1 1 1 1 lea (x, y), x
1 3 3 3 lea -1(x, y), x
2 2 2 2 addq y, x; decq x (simulates lea -1(x, y), x)
2 2 2 2 addq y, x; subq $1, x (ditto)
2 2 2 2 lea (x, y), z; decq z (simulates lea -1(x, y), z)
3 3 2 2 addq y, x; decq x; mov x, z (ditto)

NH = Nehalem Xeon X3430
SB = Sandy Bridge Xeon E5-1620
IB = Ivy Bridge Core i7-3770
SK = Skylake Xeon E3-1270 v5

github-actions · 2020-05-15T04:20:06Z

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

github-actions · 2021-07-21T04:26:11Z

This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc.

vicuna added back-end feature-wish labels Mar 14, 2019

gretay-js mentioned this issue Mar 20, 2019

Do not emit three-operand LEA (amd64) #8531

Closed

github-actions bot added the Stale label May 15, 2020

stedolan removed the Stale label May 20, 2020

github-actions bot added the Stale label Jul 21, 2021

github-actions bot closed this as completed Aug 23, 2021

gasche mentioned this issue Aug 24, 2021

Tagging integers on Intel64 processors #9850

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic #6125

Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic #6125

vicuna commented Aug 6, 2013

vicuna commented Aug 7, 2013

vicuna commented Aug 12, 2013

vicuna commented Aug 13, 2013

vicuna commented Aug 13, 2013

vicuna commented Aug 14, 2013

vicuna commented Aug 14, 2013

vicuna commented Aug 14, 2013

vicuna commented Aug 14, 2013

vicuna commented Aug 15, 2013

vicuna commented Oct 28, 2013

vicuna commented Dec 8, 2016

vicuna commented Feb 24, 2017

vicuna commented Feb 24, 2017

github-actions bot commented May 15, 2020

github-actions bot commented Jul 21, 2021