New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Slow LEA on Sandy Bridge hurts(?) OCaml's integer arithmetic #6125
Comments
Comment author: @chambart This may be the case for this version of processors (it applies also to current AMD), but it may be completely different in the future iteration. Notice that gcc and llvm continue to use lea (even with the right -march option) |
Comment author: @xavierleroy Thanks for the info concerning Sandy Bridge. To reduce interference from other instructions generated by ocamlopt, I wrote similar tests directly in assembly. The results are below. N is a Nehalem-generation Xeon and SB a Sandy Bridge Xeon. The timings (brutally rounded) are normalized so that a single "addq" is 1. N SB So, yes, 3-operand lea is definitely slow on Sandy Bridge. Breaking the lea in two instructions is beneficial for SB, but not for Nehalem. Also, this could increase instruction decoding costs -- they don't show up in these tests because they are tight loops. I'm unsure what to do about this. Especially since Intel's next generation might very well restore a fast lea. What about AMD chips? |
Comment author: vbrankov If there are just two operands, you may want to consider this, which takes about one cycle on both Nehalem and Sandy Bridge (and also might be easy to implement in the compiler):
|
Comment author: vbrankov Embarrasing, I used 32bit registers. The speed is the same with 64 bit ones. |
Comment author: vbrankov I've patched the compiler and looks like this solution produces fast integer addition on both Sandy Bridge and Nehalem. As a nice addition, it almost doubles the speed of integer subtraction. 1: 2:
in Nehalem no patch: Nehalem with patch: Sandy Bridge no patch: Sandy Bridge with patch: I've uploaded the proposed patch and the benchmark. |
Comment author: vbrankov Please use lea-patch-2.txt instead of lea-patch.txt because it's more general. To be precise, subtraction is faster because register dependency is broken and it's faster only when the dependency is on the first operand, which my guess happens more often. |
Comment author: @xavierleroy Watch out! In your first patch, the transformations x + y - 1 -> x + (y ^ 1) and x - y + 1 -> x - (y ^ 1) are not correct in general, but only when y is odd. (Consider x = y = 0.) So, you can't apply them blindly in selection.ml. cmmgen.ml might elect to use them when it knows that the arguments are 2n+1-encoded ints, but as we discussed previously it's not always a win depending on the target architecture. (Besides, you're also using one extra temporary register than the lea code pattern, which can cause more moves or spills. You won't see this negative effect on microbenchmarks.) Care to explain what your second patch does? |
Comment author: vbrankov I'm sorry, I realized this but relied on detailed scrutiny of code reviewers. Let me take more time and try to explain this patch. In its simplest form the patch does the following cmm replacement: (+ (+ x y) const) -> (+ x (+ y const)) Since it just reorders operations, I suppose it's correct. Overflows should not be a problem with addition, I can't tell about subtraction. The assembly of a two variable addition without and with the patch is roughly: #1: lea i(a, b), b "mov" should not be a problem for the instruction scheduler since it can do 6 instructions per cycle. The instruction decoder can do 4 simple instructions per cycle, so it may also not create additional load. The additional register's lifetime is very short and spilling may happen rarely enough not to outweigh the benefits. A set of benchmarks would definitely help get a better feeling of whether I'm wrong. The instruction "add i, d" can execute before "b" is computed. Hence, in the benchmark the whole block has throughput of 1. LEA has throughput of 3 because it cannot start executing until all the operands are computed. This is also the explanation for the performance with the subtraction. Still, it may cause a decrease of performance on Nehalem when there's a dependency on "d". I don't have a definitive answer to this, maybe an explicit option be necessary. |
Comment author: vbrankov I thought a little bit about the patch and for LEA it's not a clear win on anything except Sandy Bridge. The part that deals with subtraction looks better on newer CPUs, but I'll move it to a separate thread when I get to it. |
Comment author: vbrankov Somewhere along this thread there was a doubt whether it's worth working on this because the slow LEA may not persist in Intel CPUs after Sandy Bridge. This document shows that slow LEA exists in both subsequent generations of CPUs, Ivy Bridge and Haswell. |
Comment author: @mshinwell @xLeroy Do you have an opinion on this given that the slow LEA appears to be persisting (according to the reference above, at least)? |
Comment author: @damiendoligez La discussion se termine par une question pour toi. |
Comment author: @xavierleroy I re-did my measurements on more recent Intel processors, see below. 3-argument LEA remains slower. But I'm still not convinced that going from 3-cycle LEA to a 2-cycle larger instruction sequence is a win. NH SB IB SK NH = Nehalem Xeon X3430 |
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc. |
This issue has been open one year with no activity. Consequently, it is being marked with the "stale" label. What this means is that the issue will be automatically closed in 30 days unless more comments are added or the "stale" label is removed. Comments that provide new information on the issue are especially welcome: is it still reproducible? did it appear in other contexts? how critical is it? etc. |
Original bug ID: 6125
Reporter: vbrankov
Status: acknowledged (set by @xavierleroy on 2013-08-12T15:19:39Z)
Resolution: open
Priority: normal
Severity: feature
Version: 4.00.1
Category: back end (clambda to assembly)
Tags: optimization, patch
Monitored by: @gasche @diml @ygrek @yakobowski @xavierleroy
Bug description
From Intel's manuals:
"Some instructions have an increased latency in Intel® microarchitecture code name Sandy Bridge. Some LEA instructions, most notably three-operand LEA instructions, have increased latency and reduced dispatch port choices compared to other LEAs."
That's exactly what OCaml uses for integer arithmetic. Let's benchmark:
for j = 0 to len - 1 do
s := !s + a.(j);
s := !s + a.(j);
s := !s + a.(j);
s := !s + a.(j);
done
for j = 0 to len - 1 do
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
s := Int64.add !s a.(j);
done
for j = 0 to len - 1 do
s := !s + a.x;
s := !s + a.x;
s := !s + a.x;
s := !s + a.x;
done
for j = 0 to len - 1 do
s := Int64.add !s a.y;
s := Int64.add !s a.y;integer
s := Int64.add !s a.y;
s := Int64.add !s a.y;
done
Nehalem:
vs. Sandy Bridge:
Ups, 63-bit code just got three times slower. 64-bit int got faster.
When we hack OCaml compiler to turn off LEA for integer arithmetic the situation gets better on Sandy Bridge.
Nehalem:
Sandy Bridge:
The only idea to tackle this, apart from switching to 64-bit integers, is to optionally turn off three operand LEA for integer arithmetic or to replace it with two operand LEA + int operation.
File attachments
The text was updated successfully, but these errors were encountered: