New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integer division by constants #6042
Comments
Comment author: @alainfrisch This is very interesting. Could you please provide a diff, to help review your patches? |
Comment author: pdenys See attached files, for changes (amd64). I added a new Iintop type Iintop_pow2 which uses immediate divisions that only require rdx and rax to compute (mainly power of 2). Iintop_imm is now used for all constant divisions, and the algorithm that is used is in emit.ml, along with the code generation. The algorithm can be found in "Hacker's Deliglht" by Henry S. Warren, Jr. Also, I am unable to change the argument register for Iintop_imm(Idiv|Imod, _) from rdx to rcx (or any other register for that matter). I get the following: Fatal error: exception Invalid_argument("String.create") This would allow me to get rid of an unneeded movq instruction. Preliminary tests of this optimization for both division and mod (by constants that aren't powers of 2) have about a 3.9 times improvement in speed. Also, the changes to division by a power of 2 (3 shifts and an add vs 1 shift 1 test 1 cond mov and 1 add) shows marginal increase of about 1.1 times faster. |
Comment author: @jhjourdan This could be a good thing if the ocamlopt backend could do this optimization. Do you have some practical example (not just a micro benchmark) where this is a performance bottleneck ? However, I have some comments concerning your code:
|
Comment author: pdenys I just added the diff which has everything refactored so that emit.mlp is edited (instead of emit.ml) and the numbers are generated in selectgen.ml and selection.ml and then passed along in two two types that are in mach.ml. A practical example is in converting a time to its individual parts (hours, mins, sec, etc). This function is used in converting times to strings (so is called in any sort of timestamping) let to_parts_64 t = Some other practical places where this optimization helps is in the generation of random numbers. (from random.ml) (* Returns 30 random bits as an integer 0 <= x < 1073741824 ) And converting strings with escape chars to escape sequences. (from string.ml) let escaped s = |
Comment author: @alainfrisch
Did you run some benchmarks on these functions? |
Comment author: @xavierleroy Just for my personal curiosity: which algorithm did you use for this proposed optimization? Granlund & Montgomery (PLDI 1994)? The one from Hacker's Delight? Another? |
Comment author: pdenys Hacker's Delight |
Comment author: pdenys For to_span Test source code: let t = sec (Float.of_string "10000000000") let rec loop x = let () = loop 1_000_000_000 Results:
129,822,104,281 cycles # 3.776 GHz [83.33%]
Performance counter stats for './to_parts_test_old.exe':
333,087,940,246 cycles # 3.777 GHz [83.33%]
r0114 corresponds to Arith.Cylces_div_busy (see http://software.intel.com/sites/products/collateral/hpc/vtune/performance_analysis_guide.pdf) Performance counter stats for './to_parts_test_old.exe':
[14:46:52 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./to_parts_test_new.exe Performance counter stats for './to_parts_test_new.exe':
|
Comment author: pdenys For bits in random.ml Test source code: let () = Results: Performance counter stats for './bits_test_new.exe':
[15:47:08 pdenys@nyc-tec-int07 ~]$ perf stat ./bits_test_old.exe Performance counter stats for './bits_test_old.exe':
r0114 results: Performance counter stats for './bits_test_old.exe':
[15:50:01 pdenys@nyc-tec-int07 ~]$ perf stat -e r0114 ./bits_test_new.exe Performance counter stats for './bits_test_new.exe':
|
Comment author: @xavierleroy Optimization (of integer division and modulus when the divisor is a constant) implemented in SVN trunk, commits 14254-14256. Will be in OCaml 4.02. Currently supported for amd64 and i386 targets. Several simplifications compared with the proposed patch, e.g. negative divisors are not optimized (they never occur in practice), and there is no need to add new operators to the Mach intermediate language. |
Comment author: meurer I added support for ARM (BTW: shouldn't idivimm_parameter be in Emitaux?). Is there any reason we don't do this optimization in Cmmgen? This way we would not have to implement it separately for every backend. We'd just need one additional Mach instruction Imulh for this. |
Comment author: @alainfrisch Fixed the MSVC 32-bit port (commit 14262). |
Comment author: @alainfrisch Re-opening for MSVC 64-bit. |
Comment author: @alainfrisch Fixed the MSVC 64-bit port (commit 14268). |
Original bug ID: 6042
Reporter: pdenys
Status: closed (set by @xavierleroy on 2016-12-07T10:47:03Z)
Resolution: fixed
Priority: normal
Severity: tweak
Version: 4.00.1
Target version: 4.02.0+dev
Fixed in version: 4.02.0+dev
Category: back end (clambda to assembly)
Tags: patch
Related to: #6749
Monitored by: pdenys bnigito meurer @lefessan @diml @jmeber yminsky gerd @alainfrisch
Bug description
Integer division by constants other than positive powers of two is slow and can be sped up by changing from using the idiv instruction to using a imul and then a shift. Also mod falls into the same problem.
Steps to reproduce
Divide by any non positive power of 2 constant and the generated assembly will include an idiv instruction, which is a very slow instruction. To put the speed difference in perspective, on a Nehalem idiv has a latency of 37-100 cycles whereas imul has a latency of 3.
Additional information
I have finished implementing the algorithm which produces the "magic number" which is used to multiply the numerator and also a different number to shift by. I have completed the assembly code generation in the backend, and am currently testing various cases, and also changing mod.
File attachments
The text was updated successfully, but these errors were encountered: