Mantis Bug Tracker

View Issue Details Jump to Notes ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0006042OCamlOCaml backend (code generation)public2013-06-18 16:112013-11-05 16:34
Reporterpdenys 
Assigned To 
PrioritynormalSeveritytweakReproducibilityN/A
StatusacknowledgedResolutionfixed 
PlatformOSOS Version
Product Version4.00.1 
Target Version4.02.0+devFixed in Version4.02.0+dev 
Summary0006042: Integer division by constants
DescriptionInteger division by constants other than positive powers of two is slow and can be sped up by changing from using the idiv instruction to using a imul and then a shift. Also mod falls into the same problem.
Steps To ReproduceDivide by any non positive power of 2 constant and the generated assembly will include an idiv instruction, which is a very slow instruction. To put the speed difference in perspective, on a Nehalem idiv has a latency of 37-100 cycles whereas imul has a latency of 3.
Additional InformationI have finished implementing the algorithm which produces the "magic number" which is used to multiply the numerator and also a different number to shift by. I have completed the assembly code generation in the backend, and am currently testing various cases, and also changing mod.
Tagspatch
Attached Files? file icon emit.ml [^] (43,561 bytes) 2013-06-18 22:26 [Show Content]
? file icon mach.ml [^] (3,979 bytes) 2013-06-18 22:26 [Show Content]
? file icon mach.mli [^] (2,983 bytes) 2013-06-18 22:26 [Show Content]
? file icon printmach.ml [^] (7,450 bytes) 2013-06-18 22:26 [Show Content]
? file icon proc.ml [^] (9,685 bytes) 2013-06-18 22:27 [Show Content]
? file icon reload.ml [^] (4,987 bytes) 2013-06-18 22:27 [Show Content]
? file icon selection.ml [^] (8,560 bytes) 2013-06-18 22:27 [Show Content]
? file icon diff1 [^] (30,053 bytes) 2013-06-19 18:52 [Show Content]
? file icon diff2 [^] (30,053 bytes) 2013-06-19 19:32 [Show Content]

- Relationships

-  Notes
(0009550)
frisch (developer)
2013-06-18 22:33

This is very interesting. Could you please provide a diff, to help review your patches?
(0009551)
pdenys (reporter)
2013-06-18 22:35

See attached files, for changes (amd64). I added a new Iintop type Iintop_pow2 which uses immediate divisions that only require rdx and rax to compute (mainly power of 2). Iintop_imm is now used for all constant divisions, and the algorithm that is used is in emit.ml, along with the code generation. The algorithm can be found in "Hacker's Deliglht" by Henry S. Warren, Jr.

Also, I am unable to change the argument register for Iintop_imm(Idiv|Imod, _) from rdx to rcx (or any other register for that matter). I get the following:

Fatal error: exception Invalid_argument("String.create")
Raised at file "buffer.ml", line 27, characters 9-24
Called from file "format.ml", line 916, characters 13-30

This would allow me to get rid of an unneeded movq instruction.

Preliminary tests of this optimization for both division and mod (by constants that aren't powers of 2) have about a 3.9 times improvement in speed. Also, the changes to division by a power of 2 (3 shifts and an add vs 1 shift 1 test 1 cond mov and 1 add) shows marginal increase of about 1.1 times faster.
(0009555)
jacques-henri.jourdan (manager)
2013-06-19 01:46

This could be a good thing if the ocamlopt backend could do this optimization. Do you have some practical example (not just a micro benchmark) where this is a performance bottleneck ?

However, I have some comments concerning your code:
- Could you please provide a file usable with the Unix patch utility (typically generated with diff -u) ?
- emit.ml is a generated file. Please edit emit.mlp instead.
- You have to give some information to the register allocation algorithm to avoid emitting the moves you are speaking about. You have to modify destroyed_at_oper in proc.ml, and pseudoregs_for_operation in selection.ml
- Is there a good reason to perform this optimization while emitting the final code ? I think it should be preferable to do this during instruction selection (in selectgen.ml, typically). This would help implementing it for other architectures. You would have to add to Mach a multiplication operation that preserves the higher order bits.
(0009570)
pdenys (reporter)
2013-06-19 19:16
edited on: 2013-06-19 19:25

I just added the diff which has everything refactored so that emit.mlp is edited (instead of emit.ml) and the numbers are generated in selectgen.ml and selection.ml and then passed along in two two types that are in mach.ml.

A practical example is in converting a time to its individual parts (hours, mins, sec, etc). This function is used in converting times to strings (so is called in any sort of timestamping)

let to_parts_64 t =
        let t = to_float t *. 1.E6 in
        let sign = Float.sign t in
        let t =
          match sign with
          | Float.Sign.Neg -> Float.neg t
          | Float.Sign.Pos | Float.Sign.Zero -> t
        in
        let t = Float.iround_exn ~dir:`Nearest t in
        let sec = t / 1_000_000 in
        let min = sec / 60 in
        let sec = sec mod 60 in
        let hr = min / 60 in
        let min = min mod 60 in
        let us = t mod 1_000_000 in
        let ms = us / 1_000 in
        let us = us mod 1_000 in
        {Parts.
         sign = sign;
         hr = hr;
         min = min;
         sec = sec;
         ms = ms;
         us = us;
        }

Some other practical places where this optimization helps is in the generation of random numbers. (from random.ml)

(* Returns 30 random bits as an integer 0 <= x < 1073741824 *)
  let bits s =
    s.idx <- (s.idx + 1) mod 55;
    let curval = s.st.(s.idx) in
    let newval = s.st.((s.idx + 24) mod 55)
                 + (curval lxor ((curval lsr 25) land 0x1F)) in
    let newval30 = newval land 0x3FFFFFFF in (* PR#5575 *)
    s.st.(s.idx) <- newval30;
    newval30

And converting strings with escape chars to escape sequences. (from string.ml)

let escaped s =
  let n = ref 0 in
    for i = 0 to length s - 1 do
      n := !n +
        (match unsafe_get s i with
         | '"' | '\\' | '\n' | '\t' | '\r' | '\b' -> 2
         | c -> if is_printable c then 1 else 4)
    done;
    if !n = length s then s else begin
      let s' = create !n in
        n := 0;
        for i = 0 to length s - 1 do
          begin
            match unsafe_get s i with
            | ('"' | '\\') as c ->
                unsafe_set s' !n '\\'; incr n; unsafe_set s' !n c
            | '\n' ->
                unsafe_set s' !n '\\'; incr n; unsafe_set s' !n 'n'
            | '\t' ->
                unsafe_set s' !n '\\'; incr n; unsafe_set s' !n 't'
            | '\r' ->
                unsafe_set s' !n '\\'; incr n; unsafe_set s' !n 'r'
            | '\b' ->
                unsafe_set s' !n '\\'; incr n; unsafe_set s' !n 'b'
            | c ->
                if is_printable c then
                  unsafe_set s' !n c
                else begin
                  let a = char_code c in
                  unsafe_set s' !n '\\';
                  incr n;
                  unsafe_set s' !n (char_chr (48 + a / 100));
                  incr n;
                  unsafe_set s' !n (char_chr (48 + (a / 10) mod 10));
                  incr n;
                  unsafe_set s' !n (char_chr (48 + a mod 10))
                end
          end;
          incr n
        done;
        s'
      end

(0009573)
frisch (developer)
2013-06-19 19:45

> A practical example

Did you run some benchmarks on these functions?
(0009574)
xleroy (administrator)
2013-06-19 19:54

Just for my personal curiosity: which algorithm did you use for this proposed optimization? Granlund & Montgomery (PLDI 1994)? The one from Hacker's Delight? Another?
(0009575)
pdenys (reporter)
2013-06-19 19:58

Hacker's Delight
(0009578)
pdenys (reporter)
2013-06-19 20:44
edited on: 2013-06-19 21:06

For to_span

Test source code:
open Core.Std

let t = sec (Float.of_string "10000000000")

let rec loop x =
  match x with
  | 0 -> ()
  | _ ->
    let _y = Time.Span.to_parts t in
    loop (x - 1)

let () = loop 1_000_000_000

Results:
 Performance counter stats for './to_parts_test_new.exe':

      34381.751394 task-clock # 0.993 CPUs utilized
            34,658 context-switches # 0.001 M/sec
                15 CPU-migrations # 0.000 K/sec
             2,795 page-faults # 0.081 K/sec
   129,822,104,281 cycles # 3.776 GHz [83.33%]
    45,499,919,714 stalled-cycles-frontend # 35.05% frontend cycles idle [83.33%]
    10,091,678,100 stalled-cycles-backend # 7.77% backend cycles idle [66.67%]
   264,285,872,193 instructions # 2.04 insns per cycle
                                             # 0.17 stalled cycles per insn [83.33%]
    39,061,458,226 branches # 1136.110 M/sec [83.33%]
           730,374 branch-misses # 0.00% of all branches [83.34%]

      34.626771771 seconds time elapsed

Performance counter stats for './to_parts_test_old.exe':

      88195.275392 task-clock # 0.993 CPUs utilized
            88,845 context-switches # 0.001 M/sec
                15 CPU-migrations # 0.000 K/sec
             2,795 page-faults # 0.032 K/sec
   333,087,940,246 cycles # 3.777 GHz [83.33%]
   139,574,224,776 stalled-cycles-frontend # 41.90% frontend cycles idle [83.33%]
    29,436,913,862 stalled-cycles-backend # 8.84% backend cycles idle [66.67%]
   198,707,058,218 instructions # 0.60 insns per cycle
                                             # 0.70 stalled cycles per insn [83.33%]
    35,133,894,591 branches # 398.365 M/sec [83.33%]
         1,752,742 branch-misses # 0.00% of all branches [83.33%]

      88.800081569 seconds time elapsed

r0114 corresponds to Arith.Cylces_div_busy (see http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf [^])
[14:39:49 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./to_parts_test_old.exe

 Performance counter stats for './to_parts_test_old.exe':

    58,000,870,691 r0114

      88.652933165 seconds time elapsed

[14:46:52 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./to_parts_test_new.exe

 Performance counter stats for './to_parts_test_new.exe':

           136,457 r0114

      35.676162458 seconds time elapsed

(0009579)
pdenys (reporter)
2013-06-19 21:50

For bits in random.ml

Test source code:
let rec loop n =
  match n with
  | 0 -> ()
  | _ ->
    let _x = Random.bits () in
    loop (n - 1)

let () =
  Random.self_init ();
  loop 1_000_000_000

Results:
[14:54:15 pdenys@nyc-tec-int07 ~]$ perf stat ./bits_test_new.exe

 Performance counter stats for './bits_test_new.exe':

       8255.203034 task-clock # 0.994 CPUs utilized
             8,313 context-switches # 0.001 M/sec
                16 CPU-migrations # 0.002 K/sec
               237 page-faults # 0.029 K/sec
    31,137,786,184 cycles # 3.772 GHz [83.34%]
     9,342,118,830 stalled-cycles-frontend # 30.00% frontend cycles idle [83.34%]
       503,248,875 stalled-cycles-backend # 1.62% backend cycles idle [66.67%]
    77,049,643,886 instructions # 2.47 insns per cycle
                                             # 0.12 stalled cycles per insn [83.34%]
     7,011,666,780 branches # 849.363 M/sec [83.33%]
           147,939 branch-misses # 0.00% of all branches [83.32%]

       8.308312351 seconds time elapsed

[15:47:08 pdenys@nyc-tec-int07 ~]$ perf stat ./bits_test_old.exe

 Performance counter stats for './bits_test_old.exe':

      24636.132929 task-clock # 0.994 CPUs utilized
            24,814 context-switches # 0.001 M/sec
                11 CPU-migrations # 0.000 K/sec
               237 page-faults # 0.010 K/sec
    93,009,844,555 cycles # 3.775 GHz [83.33%]
    48,788,733,066 stalled-cycles-frontend # 52.46% frontend cycles idle [83.34%]
     9,853,964,618 stalled-cycles-backend # 10.59% backend cycles idle [66.67%]
    54,179,222,112 instructions # 0.58 insns per cycle
                                             # 0.90 stalled cycles per insn [83.33%]
     7,033,806,794 branches # 285.508 M/sec [83.33%]
           391,327 branch-misses # 0.01% of all branches [83.33%]

      24.789352439 seconds time elapsed

r0114 results:
[15:49:27 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./bits_test_old.exe

 Performance counter stats for './bits_test_old.exe':

     9,872,765,350 r0114

      24.613076521 seconds time elapsed

[15:50:01 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./bits_test_new.exe

 Performance counter stats for './bits_test_new.exe':

             7,023 r0114

       8.252859398 seconds time elapsed
(0010558)
xleroy (administrator)
2013-11-01 18:54

Optimization (of integer division and modulus when the divisor is a constant) implemented in SVN trunk, commits 14254-14256. Will be in OCaml 4.02. Currently supported for amd64 and i386 targets. Several simplifications compared with the proposed patch, e.g. negative divisors are not optimized (they never occur in practice), and there is no need to add new operators to the Mach intermediate language.
(0010561)
meurer (developer)
2013-11-02 16:24

I added support for ARM (BTW: shouldn't idivimm_parameter be in Emitaux?).

Is there any reason we don't do this optimization in Cmmgen? This way we would not have to implement it separately for every backend. We'd just need one additional Mach instruction Imulh for this.
(0010580)
frisch (developer)
2013-11-04 15:30

Fixed the MSVC 32-bit port (commit 14262).
(0010584)
frisch (developer)
2013-11-04 19:54

Re-opening for MSVC 64-bit.
(0010592)
frisch (developer)
2013-11-05 16:34

Fixed the MSVC 64-bit port (commit 14268).

- Issue History
Date Modified Username Field Change
2013-06-18 16:11 pdenys New Issue
2013-06-18 22:26 pdenys File Added: emit.ml
2013-06-18 22:26 pdenys File Added: mach.ml
2013-06-18 22:26 pdenys File Added: mach.mli
2013-06-18 22:26 pdenys File Added: printmach.ml
2013-06-18 22:27 pdenys File Added: proc.ml
2013-06-18 22:27 pdenys File Added: reload.ml
2013-06-18 22:27 pdenys File Added: selection.ml
2013-06-18 22:33 frisch Note Added: 0009550
2013-06-18 22:35 pdenys Note Added: 0009551
2013-06-18 22:38 pdenys Note Added: 0009552
2013-06-18 22:38 pdenys Note Deleted: 0009552
2013-06-18 22:39 pdenys Note Added: 0009553
2013-06-18 22:40 pdenys Note Added: 0009554
2013-06-18 22:40 pdenys Note Edited: 0009554 View Revisions
2013-06-18 22:40 pdenys Note Edited: 0009553 View Revisions
2013-06-19 01:46 jacques-henri.jourdan Note Added: 0009555
2013-06-19 11:54 doligez Status new => feedback
2013-06-19 11:54 doligez Target Version => 4.02.0+dev
2013-06-19 15:22 pdenys Note View State: 0009553: private
2013-06-19 15:22 pdenys Note View State: 0009554: private
2013-06-19 15:23 pdenys Note Deleted: 0009554
2013-06-19 15:23 pdenys Note Deleted: 0009553
2013-06-19 15:23 pdenys Note Added: 0009564
2013-06-19 15:23 pdenys Status feedback => new
2013-06-19 15:23 pdenys Note View State: 0009564: private
2013-06-19 15:46 frisch View Status private => public
2013-06-19 15:47 pdenys Note Deleted: 0009564
2013-06-19 18:52 pdenys File Added: diff1
2013-06-19 19:16 pdenys Note Added: 0009570
2013-06-19 19:17 pdenys Note Edited: 0009570 View Revisions
2013-06-19 19:25 pdenys Note Edited: 0009570 View Revisions
2013-06-19 19:32 pdenys File Added: diff2
2013-06-19 19:45 frisch Note Added: 0009573
2013-06-19 19:54 xleroy Note Added: 0009574
2013-06-19 19:58 pdenys Note Added: 0009575
2013-06-19 20:44 pdenys Note Added: 0009578
2013-06-19 20:46 pdenys Note Edited: 0009578 View Revisions
2013-06-19 21:06 pdenys Note Edited: 0009578 View Revisions
2013-06-19 21:50 pdenys Note Added: 0009579
2013-07-02 14:08 doligez Status new => confirmed
2013-07-12 18:15 doligez Target Version 4.02.0+dev => 4.01.1+dev
2013-10-10 11:21 xleroy Target Version 4.01.1+dev => 4.02.0+dev
2013-10-10 11:31 doligez Tag Attached: patch
2013-11-01 18:54 xleroy Note Added: 0010558
2013-11-01 18:54 xleroy Status confirmed => resolved
2013-11-01 18:54 xleroy Resolution open => fixed
2013-11-01 18:54 xleroy Fixed in Version => 4.02.0+dev
2013-11-02 16:24 meurer Note Added: 0010561
2013-11-04 15:30 frisch Note Added: 0010580
2013-11-04 19:54 frisch Note Added: 0010584
2013-11-04 19:54 frisch Status resolved => acknowledged
2013-11-05 16:34 frisch Note Added: 0010592


Copyright © 2000 - 2011 MantisBT Group
Powered by Mantis Bugtracker