Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] matrix-matrix multiply - O'Caml is 6 times slower than C
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: malc <malc@p...>
Subject: Re: [Caml-list] Re: float boxing (was: matrix-matrix multiply)
On Wed, 23 Oct 2002, malc wrote:

> > A few questions in view of this.  First, on my machine (AMD Athlon
> > 1GHz running GNU/Linux), the timings give a preference to ref.ml
> > 
> > time ./ref 100000000
> > real    0m1.279s user    0m1.280s sys     0m0.000s
> > time ./ref2 100000000
> > real    0m1.411s user    0m1.380s sys     0m0.000s
> > 
> > What could be a reason for that?
> 
> I think the reason is simple, both are more or less nop operations,
> x or x.f is not used anywhere, hence no need to allocate the float.
> This short example highlights the difference:
> 
> let useref n = 
>   let x = ref 1.0 in
>   for i = 1 to n do x := !x +. 1.0 done;
>   !x
> 
> type t = { mutable f:float };;
> let userec n = 
>   let x = { f = 1.0 } in
>   for i = 1 to n do x.f <- x.f +. 1.0 done;
>   x.f
> 
> let _ =
>   let n = int_of_string Sys.argv.(2) in
>   Printf.printf "%f\n"
>   (if Sys.argv.(1) = "ref" then
>     useref n
>   else
>     userec n)
> 
> ref# time ./refrec rec 100000000
> 100000001.000000
> 
> real    0m2.283s
> user    0m2.280s
> sys     0m0.000s
> ref# time ./refrec ref 100000000
> 100000001.000000
> 
> real    0m1.916s
> user    0m1.910s
> sys     0m0.010s
> 
> More or less same machine here.

I believe i should add something here. Let's look at the inner loops.

Mutable version:
.L107:
	fld1
	faddl	(%ecx)
	fstpl	(%ecx)
	addl	$2, %eax
	cmpl	%ebx, %eax
	jle	.L107

Suboptimal but ok...

Reference version:
.L101:
.L103:	movl	young_ptr, %eax
	subl	$12, %eax
	movl	%eax, young_ptr
	cmpl	young_limit, %eax
	jb	.L104
	leal	4(%eax), %ecx
	movl	$2301, -4(%ecx)
	fld1
	faddl	(%esi)
	fstpl	(%ecx)
	movl	%ecx, %esi
	addl	$2, %ebx
	cmpl	%edx, %ebx
	jle	.L101

Lots of instructions + boxing.. And yet its faster than mutable one..
Wonders of modern CPUs.

My first take at simplest asm code doing the same:
    mov eax, n
    fld1
    fld1
  LL:
    fadd st, st(1)
    dec eax
    jnz LL

    fstp result
    fstp st

ref# time ./c 100000000
100000001.000000

real    0m0.394s
user    0m0.390s
sys     0m0.000s

(Turned out that both gcc and icc produce similar code give or take)

P.S. It would be interesting to see timings produced by P3/P4.

-- 
mailto:malc@pulsesoft.com

-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners