Browse thread
testers wanted for experimental SSE2 back-end
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Dmitry Bely <dmitry.bely@g...> |
| Subject: | Re: [Caml-list] testers wanted for experimental SSE2 back-end |
On Tue, Mar 9, 2010 at 7:33 PM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote: > Hello list, > > This is a call for testers concerning an experimental OCaml compiler > back-end that uses SSE2 instructions for floating-point arithmetic. > This code generation strategy was discussed before on this list, and I > include below a summary in Q&A style. > > The new back-end is being considered for inclusion in the next major > release (3.12), but performance testing done so far at INRIA and by > Caml Consortium members is not conclusive. Additional results > from members of this list would therefore be very welcome. > > We're not terribly interested in small (< 50 LOC), Shootout-style > benchmarks, since their performance is very sensitive to code and data > placement. However, if some of you have a sizeable (> 500 LOC) body > of float-intensive Caml code, we'd be very interested to hear about > the compared speed of the SSE2 back-end and the old back-end on your > code. I cannot provide any benchmark yet but even not taking into account the better register organization there are at least two areas where SSE2 can outperform x87 significantly. 1. Float to integer conversion Is quite inefficient on x87 because you have to explicitly set and restore rounding mode. Typical let round x = truncate (x +. 0.5) Translates to _camlT__round_58: sub esp, 8 L100: fld L101 fadd REAL8 PTR [eax] sub esp, 8 fnstcw [esp+4] mov ax, [esp+4] mov ah, 12 mov [esp], ax fldcw [esp] fistp DWORD PTR [esp] mov eax, [esp] fldcw [esp+4] add esp, 8 lea eax, DWORD PTR [eax+eax+1] add esp, 8 ret but just to _camlT__round_58: L100: movlpd xmm0, L101 addsd xmm0, REAL8 PTR [eax] cvttsd2si eax, xmm0 lea eax, DWORD PTR [eax+eax+1] ret with SSE2. 2. Float compare Does not set flags on x87 so let fmin (x:float) y = if x < y then x else y ends up with _camlT__fmin_58: sub esp, 8 L101: mov ecx, eax fld REAL8 PTR [ebx] fld REAL8 PTR [ecx] fcompp fnstsw ax and ah, 69 cmp ah, 1 jne L100 mov eax, ecx add esp, 8 ret L100: mov eax, ebx add esp, 8 ret on SSE2 you just have _camlT__fmin_58: L101: movlpd xmm1, REAL8 PTR [ebx] movlpd xmm0, REAL8 PTR [eax] comisd xmm1, xmm0 jbe L100 ret L100: mov eax, ebx ret As for SSE2 backend presented I have some thoughts regarding the code (fast math functions via x87 are questionable, optimization of floating compare etc.) Where to discuss that - just here or there is some entry in Mantis? - Dmitry Bely