Version française
Home     About     Download     Resources     Contact us    
Browse thread
Performance of threaded interpreter on hyper-threaded CPU
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Michel Schinz <Michel.Schinz@e...>
Subject: Re: Performance of threaded interpreter on hyper-threaded CPU
Xavier Leroy <Xavier.Leroy@inria.fr> writes:

>  > When the ratio given in the last column is greater than 1, then
>  > threaded code is faster than the switch-based solution. As you can
>  > see, this is only true in my case for non-hyper-threaded
>  > architectures.
>
> Which version(s) of gcc do you use for compiling the bytecode
> interpreter?  Is it the same version on all machines?

No, unfortunately not. Here are the various versions used (I realise
this variety is annoying, but I have no control over what software
runs on these machines):

1.25 GHz PPC G4
  powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1
   (Apple Computer, Inc. build 5247)
1.70 GHz P4
  gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5)
3.0 GHz hyper-threaded P4
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
dual 3.0 GHz hyper-threaded Xeon
  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)

I'm aware of the problem due to gcc's cross-jumping "optimisation"
(described as you mention by Ertl in [1]). For the record, I tried
disabling it with -fno-crossjumping, but as Ertl mention, this didn't
change anything. However, judging by the versions of gcc I'm using,
cross-jumping should also be performed on the second machine, for
which threaded code provides a noticable gain...

However, your remark motivated me to measure the performance of a
single ocamlrun executable running on the various Pentium 4 I have at
hand, and the results are interesting...

Using the executable produced by gcc 3.2.2, I obtain the following
timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   6.34 |     4.82 |  1.3154 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.62 |     3.46 | 0.75723 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.36 |     2.59 |  1.2973 |

while using the executable produced by gcc 3.4.4, I obtain the
following timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   6.26 |     6.70 | 0.93433 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.51 |     6.15 | 0.40813 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.32 |     3.58 | 0.92737 |

Finally, I noticed that gcc 4.0.0 was also available on the second
machine, so I gave it a try, and obtained the following timings:

| architecture                      | switch | threaded |   ratio |
|-----------------------------------+--------+----------+---------|
| 1.70 GHz Pentium 4                |   7.27 |     6.62 |  1.0982 |
| 3.0 GHz Pentium 4, hyper-threaded |   2.37 |     4.75 | 0.49895 |
| dual 3.0 GHz Xeon, hyper-threaded |   3.91 |     3.56 |  1.0983 |

So the threaded code version of the OCaml VM is always slower on the
hyper-threaded P4, albeit not always by the same amount.

Michel.

[1] http://www.complang.tuwien.ac.at/forth/threading/