Browse thread
Performance of threaded interpreter on hyper-threaded CPU
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2006-04-18 (10:27) |
From: | Michel Schinz <Michel.Schinz@e...> |
Subject: | Re: Performance of threaded interpreter on hyper-threaded CPU |
Xavier Leroy <Xavier.Leroy@inria.fr> writes: > > When the ratio given in the last column is greater than 1, then > > threaded code is faster than the switch-based solution. As you can > > see, this is only true in my case for non-hyper-threaded > > architectures. > > Which version(s) of gcc do you use for compiling the bytecode > interpreter? Is it the same version on all machines? No, unfortunately not. Here are the various versions used (I realise this variety is annoying, but I have no control over what software runs on these machines): 1.25 GHz PPC G4 powerpc-apple-darwin8-gcc-4.0.1 (GCC) 4.0.1 (Apple Computer, Inc. build 5247) 1.70 GHz P4 gcc (GCC) 3.2.2 20030222 (Red Hat Linux 3.2.2-5) 3.0 GHz hyper-threaded P4 gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) dual 3.0 GHz hyper-threaded Xeon gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2) I'm aware of the problem due to gcc's cross-jumping "optimisation" (described as you mention by Ertl in [1]). For the record, I tried disabling it with -fno-crossjumping, but as Ertl mention, this didn't change anything. However, judging by the versions of gcc I'm using, cross-jumping should also be performed on the second machine, for which threaded code provides a noticable gain... However, your remark motivated me to measure the performance of a single ocamlrun executable running on the various Pentium 4 I have at hand, and the results are interesting... Using the executable produced by gcc 3.2.2, I obtain the following timings: | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.70 GHz Pentium 4 | 6.34 | 4.82 | 1.3154 | | 3.0 GHz Pentium 4, hyper-threaded | 2.62 | 3.46 | 0.75723 | | dual 3.0 GHz Xeon, hyper-threaded | 3.36 | 2.59 | 1.2973 | while using the executable produced by gcc 3.4.4, I obtain the following timings: | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.70 GHz Pentium 4 | 6.26 | 6.70 | 0.93433 | | 3.0 GHz Pentium 4, hyper-threaded | 2.51 | 6.15 | 0.40813 | | dual 3.0 GHz Xeon, hyper-threaded | 3.32 | 3.58 | 0.92737 | Finally, I noticed that gcc 4.0.0 was also available on the second machine, so I gave it a try, and obtained the following timings: | architecture | switch | threaded | ratio | |-----------------------------------+--------+----------+---------| | 1.70 GHz Pentium 4 | 7.27 | 6.62 | 1.0982 | | 3.0 GHz Pentium 4, hyper-threaded | 2.37 | 4.75 | 0.49895 | | dual 3.0 GHz Xeon, hyper-threaded | 3.91 | 3.56 | 1.0983 | So the threaded code version of the OCaml VM is always slower on the hyper-threaded P4, albeit not always by the same amount. Michel. [1] http://www.complang.tuwien.ac.at/forth/threading/