Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integer division by constants #6042

Closed
vicuna opened this issue Jun 18, 2013 · 14 comments
Closed

Integer division by constants #6042

vicuna opened this issue Jun 18, 2013 · 14 comments

Comments

@vicuna
Copy link

vicuna commented Jun 18, 2013

Original bug ID: 6042
Reporter: pdenys
Status: closed (set by @xavierleroy on 2016-12-07T10:47:03Z)
Resolution: fixed
Priority: normal
Severity: tweak
Version: 4.00.1
Target version: 4.02.0+dev
Fixed in version: 4.02.0+dev
Category: back end (clambda to assembly)
Tags: patch
Related to: #6749
Monitored by: pdenys bnigito meurer @lefessan @diml @jmeber yminsky gerd @alainfrisch

Bug description

Integer division by constants other than positive powers of two is slow and can be sped up by changing from using the idiv instruction to using a imul and then a shift. Also mod falls into the same problem.

Steps to reproduce

Divide by any non positive power of 2 constant and the generated assembly will include an idiv instruction, which is a very slow instruction. To put the speed difference in perspective, on a Nehalem idiv has a latency of 37-100 cycles whereas imul has a latency of 3.

Additional information

I have finished implementing the algorithm which produces the "magic number" which is used to multiply the numerator and also a different number to shift by. I have completed the assembly code generation in the backend, and am currently testing various cases, and also changing mod.

File attachments

@vicuna
Copy link
Author

vicuna commented Jun 18, 2013

Comment author: @alainfrisch

This is very interesting. Could you please provide a diff, to help review your patches?

@vicuna
Copy link
Author

vicuna commented Jun 18, 2013

Comment author: pdenys

See attached files, for changes (amd64). I added a new Iintop type Iintop_pow2 which uses immediate divisions that only require rdx and rax to compute (mainly power of 2). Iintop_imm is now used for all constant divisions, and the algorithm that is used is in emit.ml, along with the code generation. The algorithm can be found in "Hacker's Deliglht" by Henry S. Warren, Jr.

Also, I am unable to change the argument register for Iintop_imm(Idiv|Imod, _) from rdx to rcx (or any other register for that matter). I get the following:

Fatal error: exception Invalid_argument("String.create")
Raised at file "buffer.ml", line 27, characters 9-24
Called from file "format.ml", line 916, characters 13-30

This would allow me to get rid of an unneeded movq instruction.

Preliminary tests of this optimization for both division and mod (by constants that aren't powers of 2) have about a 3.9 times improvement in speed. Also, the changes to division by a power of 2 (3 shifts and an add vs 1 shift 1 test 1 cond mov and 1 add) shows marginal increase of about 1.1 times faster.

@vicuna
Copy link
Author

vicuna commented Jun 18, 2013

Comment author: @jhjourdan

This could be a good thing if the ocamlopt backend could do this optimization. Do you have some practical example (not just a micro benchmark) where this is a performance bottleneck ?

However, I have some comments concerning your code:

  • Could you please provide a file usable with the Unix patch utility (typically generated with diff -u) ?
  • emit.ml is a generated file. Please edit emit.mlp instead.
  • You have to give some information to the register allocation algorithm to avoid emitting the moves you are speaking about. You have to modify destroyed_at_oper in proc.ml, and pseudoregs_for_operation in selection.ml
  • Is there a good reason to perform this optimization while emitting the final code ? I think it should be preferable to do this during instruction selection (in selectgen.ml, typically). This would help implementing it for other architectures. You would have to add to Mach a multiplication operation that preserves the higher order bits.

@vicuna
Copy link
Author

vicuna commented Jun 19, 2013

Comment author: pdenys

I just added the diff which has everything refactored so that emit.mlp is edited (instead of emit.ml) and the numbers are generated in selectgen.ml and selection.ml and then passed along in two two types that are in mach.ml.

A practical example is in converting a time to its individual parts (hours, mins, sec, etc). This function is used in converting times to strings (so is called in any sort of timestamping)

let to_parts_64 t =
let t = to_float t *. 1.E6 in
let sign = Float.sign t in
let t =
match sign with
| Float.Sign.Neg -> Float.neg t
| Float.Sign.Pos | Float.Sign.Zero -> t
in
let t = Float.iround_exn ~dir:`Nearest t in
let sec = t / 1_000_000 in
let min = sec / 60 in
let sec = sec mod 60 in
let hr = min / 60 in
let min = min mod 60 in
let us = t mod 1_000_000 in
let ms = us / 1_000 in
let us = us mod 1_000 in
{Parts.
sign = sign;
hr = hr;
min = min;
sec = sec;
ms = ms;
us = us;
}

Some other practical places where this optimization helps is in the generation of random numbers. (from random.ml)

(* Returns 30 random bits as an integer 0 <= x < 1073741824 )
let bits s =
s.idx <- (s.idx + 1) mod 55;
let curval = s.st.(s.idx) in
let newval = s.st.((s.idx + 24) mod 55)
+ (curval lxor ((curval lsr 25) land 0x1F)) in
let newval30 = newval land 0x3FFFFFFF in (
#5575 *)
s.st.(s.idx) <- newval30;
newval30

And converting strings with escape chars to escape sequences. (from string.ml)

let escaped s =
let n = ref 0 in
for i = 0 to length s - 1 do
n := !n +
(match unsafe_get s i with
| '"' | '\' | '\n' | '\t' | '\r' | '\b' -> 2
| c -> if is_printable c then 1 else 4)
done;
if !n = length s then s else begin
let s' = create !n in
n := 0;
for i = 0 to length s - 1 do
begin
match unsafe_get s i with
| ('"' | '\') as c ->
unsafe_set s' !n '\'; incr n; unsafe_set s' !n c
| '\n' ->
unsafe_set s' !n '\'; incr n; unsafe_set s' !n 'n'
| '\t' ->
unsafe_set s' !n '\'; incr n; unsafe_set s' !n 't'
| '\r' ->
unsafe_set s' !n '\'; incr n; unsafe_set s' !n 'r'
| '\b' ->
unsafe_set s' !n '\'; incr n; unsafe_set s' !n 'b'
| c ->
if is_printable c then
unsafe_set s' !n c
else begin
let a = char_code c in
unsafe_set s' !n '\';
incr n;
unsafe_set s' !n (char_chr (48 + a / 100));
incr n;
unsafe_set s' !n (char_chr (48 + (a / 10) mod 10));
incr n;
unsafe_set s' !n (char_chr (48 + a mod 10))
end
end;
incr n
done;
s'
end

@vicuna
Copy link
Author

vicuna commented Jun 19, 2013

Comment author: @alainfrisch

A practical example

Did you run some benchmarks on these functions?

@vicuna
Copy link
Author

vicuna commented Jun 19, 2013

Comment author: @xavierleroy

Just for my personal curiosity: which algorithm did you use for this proposed optimization? Granlund & Montgomery (PLDI 1994)? The one from Hacker's Delight? Another?

@vicuna
Copy link
Author

vicuna commented Jun 19, 2013

Comment author: pdenys

Hacker's Delight

@vicuna
Copy link
Author

vicuna commented Jun 19, 2013

Comment author: pdenys

For to_span

Test source code:
open Core.Std

let t = sec (Float.of_string "10000000000")

let rec loop x =
match x with
| 0 -> ()
| _ ->
let _y = Time.Span.to_parts t in
loop (x - 1)

let () = loop 1_000_000_000

Results:
Performance counter stats for './to_parts_test_new.exe':

  34381.751394 task-clock                #    0.993 CPUs utilized          
        34,658 context-switches          #    0.001 M/sec                  
            15 CPU-migrations            #    0.000 K/sec                  
         2,795 page-faults               #    0.081 K/sec                  

129,822,104,281 cycles # 3.776 GHz [83.33%]
45,499,919,714 stalled-cycles-frontend # 35.05% frontend cycles idle [83.33%]
10,091,678,100 stalled-cycles-backend # 7.77% backend cycles idle [66.67%]
264,285,872,193 instructions # 2.04 insns per cycle
# 0.17 stalled cycles per insn [83.33%]
39,061,458,226 branches # 1136.110 M/sec [83.33%]
730,374 branch-misses # 0.00% of all branches [83.34%]

  34.626771771 seconds time elapsed

Performance counter stats for './to_parts_test_old.exe':

  88195.275392 task-clock                #    0.993 CPUs utilized          
        88,845 context-switches          #    0.001 M/sec                  
            15 CPU-migrations            #    0.000 K/sec                  
         2,795 page-faults               #    0.032 K/sec                  

333,087,940,246 cycles # 3.777 GHz [83.33%]
139,574,224,776 stalled-cycles-frontend # 41.90% frontend cycles idle [83.33%]
29,436,913,862 stalled-cycles-backend # 8.84% backend cycles idle [66.67%]
198,707,058,218 instructions # 0.60 insns per cycle
# 0.70 stalled cycles per insn [83.33%]
35,133,894,591 branches # 398.365 M/sec [83.33%]
1,752,742 branch-misses # 0.00% of all branches [83.33%]

  88.800081569 seconds time elapsed

r0114 corresponds to Arith.Cylces_div_busy (see http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf)
[14:39:49 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./to_parts_test_old.exe

Performance counter stats for './to_parts_test_old.exe':

58,000,870,691 r0114                                                       

  88.652933165 seconds time elapsed

[14:46:52 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./to_parts_test_new.exe

Performance counter stats for './to_parts_test_new.exe':

       136,457 r0114                                                       

  35.676162458 seconds time elapsed

@vicuna
Copy link
Author

vicuna commented Jun 19, 2013

Comment author: pdenys

For bits in random.ml

Test source code:
let rec loop n =
match n with
| 0 -> ()
| _ ->
let _x = Random.bits () in
loop (n - 1)

let () =
Random.self_init ();
loop 1_000_000_000

Results:
[14:54:15 pdenys@nyc-tec-int07 ~]$ perf stat ./bits_test_new.exe

Performance counter stats for './bits_test_new.exe':

   8255.203034 task-clock                #    0.994 CPUs utilized          
         8,313 context-switches          #    0.001 M/sec                  
            16 CPU-migrations            #    0.002 K/sec                  
           237 page-faults               #    0.029 K/sec                  
31,137,786,184 cycles                    #    3.772 GHz                     [83.34%]
 9,342,118,830 stalled-cycles-frontend   #   30.00% frontend cycles idle    [83.34%]
   503,248,875 stalled-cycles-backend    #    1.62% backend  cycles idle    [66.67%]
77,049,643,886 instructions              #    2.47  insns per cycle        
                                         #    0.12  stalled cycles per insn [83.34%]
 7,011,666,780 branches                  #  849.363 M/sec                   [83.33%]
       147,939 branch-misses             #    0.00% of all branches         [83.32%]

   8.308312351 seconds time elapsed

[15:47:08 pdenys@nyc-tec-int07 ~]$ perf stat ./bits_test_old.exe

Performance counter stats for './bits_test_old.exe':

  24636.132929 task-clock                #    0.994 CPUs utilized          
        24,814 context-switches          #    0.001 M/sec                  
            11 CPU-migrations            #    0.000 K/sec                  
           237 page-faults               #    0.010 K/sec                  
93,009,844,555 cycles                    #    3.775 GHz                     [83.33%]
48,788,733,066 stalled-cycles-frontend   #   52.46% frontend cycles idle    [83.34%]
 9,853,964,618 stalled-cycles-backend    #   10.59% backend  cycles idle    [66.67%]
54,179,222,112 instructions              #    0.58  insns per cycle        
                                         #    0.90  stalled cycles per insn [83.33%]
 7,033,806,794 branches                  #  285.508 M/sec                   [83.33%]
       391,327 branch-misses             #    0.01% of all branches         [83.33%]

  24.789352439 seconds time elapsed

r0114 results:
[15:49:27 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./bits_test_old.exe

Performance counter stats for './bits_test_old.exe':

 9,872,765,350 r0114                                                       

  24.613076521 seconds time elapsed

[15:50:01 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./bits_test_new.exe

Performance counter stats for './bits_test_new.exe':

         7,023 r0114                                                       

   8.252859398 seconds time elapsed

@vicuna
Copy link
Author

vicuna commented Nov 1, 2013

Comment author: @xavierleroy

Optimization (of integer division and modulus when the divisor is a constant) implemented in SVN trunk, commits 14254-14256. Will be in OCaml 4.02. Currently supported for amd64 and i386 targets. Several simplifications compared with the proposed patch, e.g. negative divisors are not optimized (they never occur in practice), and there is no need to add new operators to the Mach intermediate language.

@vicuna
Copy link
Author

vicuna commented Nov 2, 2013

Comment author: meurer

I added support for ARM (BTW: shouldn't idivimm_parameter be in Emitaux?).

Is there any reason we don't do this optimization in Cmmgen? This way we would not have to implement it separately for every backend. We'd just need one additional Mach instruction Imulh for this.

@vicuna
Copy link
Author

vicuna commented Nov 4, 2013

Comment author: @alainfrisch

Fixed the MSVC 32-bit port (commit 14262).

@vicuna
Copy link
Author

vicuna commented Nov 4, 2013

Comment author: @alainfrisch

Re-opening for MSVC 64-bit.

@vicuna
Copy link
Author

vicuna commented Nov 5, 2013

Comment author: @alainfrisch

Fixed the MSVC 64-bit port (commit 14268).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant