For 64-bit targets, array bound checking assembly code could be more compact #5360

vicuna · 2011-09-25T12:56:53Z

Original bug ID: 5360
Reporter: Pascal Cuoq
Status: closed (set by @xavierleroy on 2012-09-25T18:07:21Z)
Resolution: fixed
Priority: normal
Severity: tweak
Version: 3.12.1
Category: ~DO NOT USE (was: OCaml general)
Related to: #5345
Monitored by: @protz mehdi @ygrek

Bug description

Consider the following module:

type t

let f (a: t array) =
a.(7)

Here is a diff between the code generated by OCaml 3.12.1 and the code that could be expected for bounds checking, when values have the same size as double-precision floating-point numbers :

--- t.orig.s 2011-09-25 14:43:16.000000000 +0200
+++ t.s 2011-09-25 14:44:05.000000000 +0200
@@ -28,20 +28,17 @@
movq %rax, %rdi
movq -8(%rdi), %rax
movq %rax, %rbx

andq $255, %rbx
cmpq $254, %rbx
je .L100
shrq $9, %rax
cmpq $15, %rax

shrq $9, %rbx
cmpq $15, %rbx
jbe .L102
andq $255, %rax
cmpq $254, %rax
je .L100
movq 56(%rdi), %rax
addq $8, %rsp
ret
.align 4
.L100:

shrq $9, %rax
cmpq $15, %rax
jbe .L102
.L103: subq $16, %r15
movq caml_young_limit@GOTPCREL(%rip), %rax
cmpq (%rax), %r15

The minimally tested attached patch implements the new version.

File attachments

vicuna · 2011-12-21T10:24:12Z

Comment author: @gasche

Mandatory hard-facts question -- I hope xleroy didn't patent it: do you have an example of real application (Frama-C?) where this induced a noticeable performance improvement?

vicuna · 2011-12-22T00:30:21Z

Comment author: pascal_cuoq

This is not a performance optimization, it is a code size optimization. In realistic function f above, the same three-instruction sequence appears at the beginning of each of two branches, making the code of this function three instructions longer than necessary. This is stupid and can be fixed with the same "if wordsize_shift = numfloat_shift" test that's already used elsewhere in cmmgen.ml.

I have no doubt that this is entirely absorbed by other inefficiencies in the compiler, if that's what you are getting at. The only hard facts I have at hand are that the Frama-C core (excluding most functionality) compiles to a 10MB binary and that compiling with -compact makes it smaller and faster, hinting that code bloat does influence performance.

vicuna · 2011-12-22T08:53:14Z

Comment author: @gasche

I compiled frama-C without -compact on my trunk, and here are the size of toplevel.opt:

10476K (10697637) without any patch
10472K (10691109) with your patch (reduces Parrayrefs)
10468K (10689669) with the patch I just uploaded, that reduces both Parrayrefs and Parraysets (I haven't checked it carefully though)

The difference is rather small.

That said, in the process I discovered that this optimization is already performed in the Parraylength case, so applying it in the Parrayref and Parrayset cases does not seem preposterous.

PS: Oh, and Frama-C doesn't compile as is against trunk, because you have some Datatype.Hashtbl functor with input signature Hashtbl.S, to which you pass your Inthash module, but the signature of stdlib's Hashtbl changed in trunk (a new "stats" function is available). It is not future-proof to depend contravariantly on a stdlib interface.

vicuna · 2011-12-22T09:20:27Z

Comment author: @xavierleroy

Patch #2 applied in SVN trunk (commit 11932).

I agreee more compact code is better, and optimizations specific to 64-bit targets are welcome since these targets are getting dominant.

For Parraylength, the 64-bit case saves one run-time test, while for Parrayrefs and Parraysets we have the same number of tests executed (2), but produce one fewer. So, the benefits will be much smaller.

vicuna · 2011-12-22T17:13:04Z

Comment author: pascal_cuoq

Performance:

All together, the patches here and from entry #5345 (including the assembly-level patch 0002 that was rejected) applied to OCaml 3.12.1 appear to make the sha1.c benchmark from http://blog.frama-c.com/index.php?post/2011/08/29/CompCert-gets-a-safe-interpreter-mode 1.5% faster (same protocol as in the blog post, sha1.c modified to run the same computations 5 times). This is pretty cool even if it does not nearly bridge the gap with CompCert's interpreter.

Other Frama-C use cases would use a smaller proportion of array operations, but yet other use cases would use more. I think the sha1.c benchmark is average in this respect.

vicuna closed this as completed Sep 25, 2012

vicuna mentioned this issue Sep 25, 2012

Optimize bounds checking #5345

Closed

vicuna added the bug label Mar 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For 64-bit targets, array bound checking assembly code could be more compact #5360

For 64-bit targets, array bound checking assembly code could be more compact #5360

vicuna commented Sep 25, 2011

vicuna commented Dec 21, 2011

vicuna commented Dec 22, 2011

vicuna commented Dec 22, 2011

vicuna commented Dec 22, 2011

vicuna commented Dec 22, 2011

For 64-bit targets, array bound checking assembly code could be more compact #5360

For 64-bit targets, array bound checking assembly code could be more compact #5360

Comments

vicuna commented Sep 25, 2011

Bug description

File attachments

vicuna commented Dec 21, 2011

vicuna commented Dec 22, 2011

vicuna commented Dec 22, 2011

vicuna commented Dec 22, 2011

vicuna commented Dec 22, 2011