Version française
Home     About     Download     Resources     Contact us    
Browse thread
testers wanted for experimental SSE2 back-end
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Dmitry Bely <dmitry.bely@g...>
Subject: Re: [Caml-list] testers wanted for experimental SSE2 back-end
On Mon, Mar 29, 2010 at 8:49 PM, Xavier Leroy <Xavier.Leroy@inria.fr> wrote:

>>> This is a call for testers concerning an experimental OCaml compiler
>>> back-end that uses SSE2 instructions for floating-point arithmetic.[...]
>>
>> I cannot provide any benchmark yet
>
> Too bad :-( I got very little feedback to my call: just one data point
> (thanks Gaetan).  Perhaps most OCaml users interested in numerical
> computations have switched to x86-64bits already?  At any rate, given
> such a lack of interest, this x86-32/SSE2 port isn't going to make it
> into the OCaml distribution.

It's a pity. Probably even my (future) benchmarks won't help...

>> but even not taking into account
>> the better register organization there are at least two areas where
>> SSE2 can outperform x87 significantly.
>>
>> 1. Float to integer conversion
>> Is quite inefficient on x87 because you have to explicitly set and
>> restore rounding mode.
>
> Right.  The mode change makes the conversion about 10x slower on x87
> than on SSE2.  Apparently, float->int conversion is uncommon is
> numerical code, otherwise we'd observe bigger speedups on real
> applications...
>
>> 2. Float compare
>> Does not set flags on x87 so
>
> The SSE2 code is prettier than the x87 code, but this doesn't seem to
> translate into a significant performance gain, in my limited testing.
>
>> As for SSE2 backend presented I have some thoughts regarding the code
>> (fast math functions via x87 are questionable,
>
> Most x86-32bits C libraries implement sin(), cos(), etc with the x87
> instructions, so I'm curious to know what you find objectionable here.

Microsoft's implementation for P4 and above is SSE2-based. And Intel
itself recommends to do so:

[quote]
What Is AM Library?
===================
Ever missed a sine or arctangent instruction among Intel Streaming
SIMD Extensions? Ever wished there were a way to calculate logarithm
or exponent in about a dozen cycles? Here is a new release of

Approximate Math Library (AM Library) -- a set of fast routines to
calculate math functions using Intel(R) Streaming SIMD Extensions
(SSE) and Streaming SIMD Extensions 2 (SSE2). The Library offers
trigonometric, reverse trigonometric, logarithmic, and exponential
functions for packed and scalar arguments. The processing speed is
many times faster than that of x87 instructions and even of table
lookups. The accuracy of AM Library routines can be adequate for many
applications. It is comparable with that of reciprocal SSE
instructions, and is hundreds times better than what is achievable
with lookup tables.

The AM Library is provided along with the full source code and a usage sample.
[end of quote]

http://www.intel.com/design/pentiumiii/devtools/AMaths.zip

Another interesting reading:

http://users.ece.utexas.edu/~adnan/comm/fast-trigonometric-functions-using.pdf

>> optimization of floating compare etc.) Where to discuss that - just
>> here or there is some entry in Mantis?
>
> Why not start on this list?  We'll move to private e-mail if the
> discussion becomes too heated :-)

OK

1. My variant of emit_float_test (in many cases eliminates extra jump).

let emit_float_test cmp neg arg lbl =
  let opcode_jp cmp =
    match (cmp, neg) with
      (Ceq, false) -> ("je", true)
    | (Ceq, true)  -> ("jne", true)
    | (Cne, false) -> ("jne", true)
    | (Cne, true)  -> ("je", true)
    | (Clt, false) -> ("jb", true)
    | (Clt, true)  -> ("jae", true)
    | (Cle, false) -> ("jbe", true)
    | (Cle, true)  -> ("ja", true)
    | (Cgt, false) -> ("ja", false)
    | (Cgt, true)  -> ("jbe", false)
    | (Cge, false) -> ("jae", true)
    | (Cge, true)  -> ("jb", false) in
  let branch_opcode, need_jp = opcode_jp cmp in
  let branch_opcode, arg0, arg1, need_jp  =
    match arg.(1).loc with
      Reg _ when need_jp ->
      (* swap args if it excludes jmp *)
      let (branch_opcode_swap, need_jp_swap) =
        opcode_jp
          (match cmp with
            Ceq -> Ceq
          | Cne -> Cne
          | Clt -> Cgt
          | Cle -> Cge
          | Cgt -> Clt
          | Cge -> Cle) in
      if need_jp_swap
      then (branch_opcode, arg.(0), arg.(1), true)
      else (branch_opcode_swap, arg.(1), arg.(0), false)
    | _ ->
      (branch_opcode, arg.(0), arg.(1), need_jp)
  in
  begin match cmp with
  | Ceq | Cne -> `	ucomisd	`
  | _         -> `	comisd	`
  end;
  `{emit_reg arg0}, {emit_reg arg1}\n`;
  let branch_if_not_comparable =
    if cmp = Cne then not neg else neg in
  if need_jp then
    if branch_if_not_comparable then begin
      `	jp	{emit_label lbl}\n`;
      `	{emit_string branch_opcode}	{emit_label lbl}\n`
    end else begin
      let next = new_label() in
      `	jp	{emit_label next}\n`;
      `	{emit_string branch_opcode}	{emit_label lbl}\n`;
      `{emit_label next}:\n`
    end
  else begin
    `	{emit_string branch_opcode}	{emit_label lbl}\n`
  end

2. My variant of fast math functions (see explanation above)

let emit_floatspecial = function
    "sqrt"  -> `	sqrtsd	`
  | _ -> assert false

3. Loading st(0) can be two instructions shorter :)

              `	sub	esp, 8\n`;
              `	fstp	REAL8 PTR [esp]\n`;
              `	movsd	{emit_reg dst}, REAL8 PTR [esp]\n`;
              `	add	esp, 8\n`

can be written as

              `	fstp	REAL8 PTR [esp-8]\n`;
              `	movlpd	{emit_reg dst}, REAL8 PTR [esp-8]\n`;

4. Unnecessary instruction in Lop(Iload(Single, addr))

            `	movss	{emit_reg dest}, REAL4 PTR {emit_addressing addr
i.arg 0}\n`;
            `	cvtss2sd {emit_reg dest}, {emit_reg dest}\n`

can be written as

            `	cvtss2sd {emit_reg dest}, REAL4 PTR {emit_addressing
addr i.arg 0}\n`

- Dmitry Bely