Version française
Home     About     Download     Resources     Contact us    
Browse thread
testers wanted for experimental SSE2 back-end
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Dmitry Bely <dmitry.bely@g...>
Subject: Re: [Caml-list] testers wanted for experimental SSE2 back-end
On Tue, Mar 9, 2010 at 7:33 PM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:
> Hello list,
>
> This is a call for testers concerning an experimental OCaml compiler
> back-end that uses SSE2 instructions for floating-point arithmetic.
> This code generation strategy was discussed before on this list, and I
> include below a summary in Q&A style.
>
> The new back-end is being considered for inclusion in the next major
> release (3.12), but performance testing done so far at INRIA and by
> Caml Consortium members is not conclusive.  Additional results
> from members of this list would therefore be very welcome.
>
> We're not terribly interested in small (< 50 LOC), Shootout-style
> benchmarks, since their performance is very sensitive to code and data
> placement.  However, if some of you have a sizeable (> 500 LOC) body
> of float-intensive Caml code, we'd be very interested to hear about
> the compared speed of the SSE2 back-end and the old back-end on your
> code.

I cannot provide any benchmark yet but even not taking into account
the better register organization there are at least two areas where
SSE2 can outperform x87 significantly.

1. Float to integer conversion
Is quite inefficient on x87 because you have to explicitly set and
restore rounding mode. Typical

let round x = truncate (x +. 0.5)

Translates to

_camlT__round_58:
	sub	esp, 8
L100:
	fld	L101
	fadd	REAL8 PTR [eax]
	sub	esp, 8
	fnstcw	[esp+4]
	mov	ax, [esp+4]
	mov	ah, 12
	mov	[esp], ax
	fldcw	[esp]
	fistp	DWORD PTR [esp]
	mov	eax, [esp]
	fldcw	[esp+4]
	add	esp, 8
	lea	eax, DWORD PTR [eax+eax+1]
	add    esp, 8
	ret

but just to

_camlT__round_58:
L100:
	movlpd	xmm0, L101
	addsd	xmm0, REAL8 PTR [eax]
	cvttsd2si	eax, xmm0
	lea	eax, DWORD PTR [eax+eax+1]
	ret

with SSE2.

2. Float compare
Does not set flags on x87 so

let fmin (x:float) y = if x < y then x else y

ends up with

_camlT__fmin_58:
	sub	esp, 8
L101:
	mov	ecx, eax
	fld	REAL8 PTR [ebx]
	fld	REAL8 PTR [ecx]
	fcompp
	fnstsw	ax
	and	ah, 69
	cmp	ah, 1
	jne	L100
	mov	eax, ecx
	add    esp, 8
	ret
L100:
	mov	eax, ebx
	add    esp, 8
	ret

on SSE2 you just have

_camlT__fmin_58:
L101:
	movlpd	xmm1, REAL8 PTR [ebx]
	movlpd	xmm0, REAL8 PTR [eax]
	comisd	xmm1, xmm0
	jbe	L100
	ret
L100:
	mov	eax, ebx
	ret

As for SSE2 backend presented I have some thoughts regarding the code
(fast math functions via x87 are questionable, optimization of
floating compare etc.) Where to discuss that - just here or there is
some entry in Mantis?

- Dmitry Bely