[Camllist] Some clarifications to the language shootout page
 Siegfried Gonzi
[
Home
]
[ Index:
by date

by threads
]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
[ Message by date: previous  next ] [ Message in thread: previous  next ] [ Thread: previous  next ]
Date:   (:) 
From:  Siegfried Gonzi <siegfried.gonzi@k...> 
Subject:  [Camllist] Some clarifications to the language shootout page 
Introduction  The language shootout page has some nice examples where Bigloo and OCaml stacked up well when compared to C. But I think this is only valid for the micro benchmarks. Take the matrixmatrix multiplication (MM) for example. Though, rarely ever will you be in dire need for a matrixmatrix multiplication, because there exists better methods for solving problems in linear algebra. However, the matrixmatrix multiplication is useful to check some facts. Method  I downloaded the codes for the MM from the language shootout page and adapted the examples as follows: The overall code scaffolding has been left as it was with the original version on the language shootout page: a) The dimension is increased to 512x512 b) The run is just performed for one MM c) The arrays are double precision Results and Discussion  Though, the published MM codes at the language shootout page do not use any sophisticated constructs like loop unrolling or the like, they can nevertheless be used because they are consistent within their specific programming language namely C, C++, Scheme (Bigloo), OCaml, Python and Fortran 90/95. However, I have assumed that the posted codes there are posted by masters of their specific language (not every code has been written by the owner of the language shootout page). The code can be found at the end of this post. It is easy to copy and paste it for own tests: memory has been observed via "top": compile real/usr/sys memory  bigloo Obench bench.scm 22s/22s/0.1s 6% bigloo Obench bench.scm 1m30s/1m30s/0.3s 30% ocamlopt bench.ml 38s/38s/0.1s 3% g++ O6 bench.c 8s/8s/0.1s 3% g++ O6 bench.cpp 8.5s/8.5s/0.1s 5% python bench.py 4m16s/4m16s/0.1s 6% ifc O3 bench.f90 13s/13s/0.04s 3% [suffixes: scm...Scheme (Bigloo); ml...OCaml; c...gnu C; cpp...gnu C++; py...Python; f90...Fortran 90/95] [Machine: laptop Celeron 1000MHz, RAM 256MB, SuSE Linux 8.0, gcc, bigloo 2.5a, ocaml 3.04, Python 2.2, Intel Fortran 90/95] The second Bigloo version uses ordinary "+" and "" operators. The first version uses Bigloo's operators "+fl", "+fx",... It is incomprehensible why the second ordinary Bigloo version uses that much on memory! For a dimension of 1024x1024 the overall timing picture is the same, except that the normal Bigloo version consumes 80% of the memory as compared to the C version which consumes 20%. Especially important is the fact that the Python version stacks up very well; you may not forget Python is just interpreted and not compiled. The elegant C++ version is of interest too, because it relies on the vector container and it is possible to completely avoid the awful "*". Some note to the Fortran version: Fortran starts at 1 (C starts at 0) when indexing arrays; and the first index runs faster (in C the second index runs faster). Reading from left to right is more natural (index wise), so using C arrays is way more pleasant. Bigloo uses C's array indexing scheme. I hate Fortran for this array indexing nightmare and hope I didn't mix things up because I was a little bit surprised that C is faster than Fortran 90. The Fortran 90/95 compiler stems from Intel and is the one and only free Fortran 90/95 compiler available for Linux (go to the Intel site and download the 50MB if you need a Fortran 90/95 compiler). Nowadays times are over were Fortran is that much faster than C/C++ on ordinary home PCs. Computing in Science and Engineering had once a nice article (Theurich, G., et al., 2001: "Making the FortranToC Transition: How Painful is it Really?", January/February 2001, p.21) about translating a 20,000 lines of code program (in the field of chemistry) from Fortran 77 to C. Everybody would have excpected that the C version runs slower than the original Fortran version, but it turned out that the C version runs even slightly faster than the Fortran version. Before you scream: a) I am sure there exists an ocaml option for further optimizations; "ocamlopt help" did not unveil it. b) The Bigloo (Scheme) version could surely be improved but how? /I mean not the code scheme itslef, because this should remain consistent as compared to the other code versions./ Fore example transposing one array is not allowed, because this would not only reduce Bigloo's execution time but also the time of the C version would be reduced by a factor of 2. Surely, there exists the caveat: But mum told me, we should not use the same algorithms for every programming language. But mum didn't tell that often the reality does not let you do so. And in addition mum did not tell you that you are not a Forth programmer. c) I post it here for soliciting comments. I could have gone to bed to construe my own explanations and theories. But this would likely result in a desaster, because most of the time it isn't how I would like that it is. The original language shootout page results on the MM are misleading: a) Pure integers are hardly ever used in reality b) Problems often do not fit into the cache c) A comparsion of the FFT would be more interesting Conclusion  Nevertheless Bigloo remains still a very decent compiler in my opinion, and I do not regret that I have made the transition from Python (which is one of the most confusing programming languages out there) to Scheme speak Bigloo. People should be more wary that consulting the language shootout page can result in misleading conclusions; as you know: benchmarking is black art. S. Gonzi APPENDIX: =============================== Normal Bigloo version original v.: (C) M. Serrano usage: bigloo Obench bench.scm time ./a.out =============================== (module benchmatrix) (define (mkmatrix rows cols) (let ((mx (makevector rows 0.0)) (count 1.0)) (do ((i 0 (+ i 1))) ((= i rows)) (let ((row (makevector cols 0.0))) (do ((j 0 (+ j 1))) ((= j cols)) (vectorset! row j count) (set! count (+ count 1.0))) (vectorset! mx i row))) mx)) (define (numcols mx) (let ((row (vectorref mx 0))) (vectorlength row))) (define (numrows mx) (vectorlength mx)) (define (mmult rows cols m1 m2) (let ((m3 (makevector rows 0.0))) (do ((i 0 (+ 1 i))) ((= i rows)) (let ((m1i (vectorref m1 i)) (row (makevector cols 0.0))) (do ((j 0 (+ 1 j))) ((= j cols)) (let ((val 0.0)) (do ((k 0 (+ 1 k))) ((= k cols)) (set! val (+ val (* (vectorref m1i k) (vectorref (vectorref m2 k) j))))) (vectorset! row j val))) (vectorset! m3 i row))) m3)) (define (domain size) (let ((mm 0.0) (m1 (mkmatrix size size)) (m2 (mkmatrix size size))) (set! mm (mmult size size m1 m2)) (let ((r0 (vectorref mm 0)) (r2 (vectorref mm 2)) (r3 (vectorref mm 3)) (r4 (vectorref mm 4))) (print (vectorref r0 0) " " (vectorref r2 3) " " (vectorref r3 2) " " (vectorref r4 4))))) (domain 512) ==== =============================== Bigloo version based on +fl,... original v.: (C) M. Serrano usage: bigloo Obench bench.scm time ./a.out =============================== (module benchmatrix (option (set! *genericity* #f))) (define (mkmatrix rows cols) (let ((mx (makevector rows 0.0)) (count 1.0)) (do ((i 0 (+fx i 1))) ((=fx i rows)) (let ((row (makevector cols 0.0))) (do ((j 0 (+fx j 1))) ((=fx j cols)) (vectorset! row j count) (set! count (+fl count 1.0))) (vectorset! mx i row))) mx)) (define (numcols mx) (let ((row (vectorref mx 0))) (vectorlength row))) (define (numrows mx) (vectorlength mx)) (define (mmult rows cols m1 m2) (let ((m3 (makevector rows 0.0))) (do ((i 0 (+fx 1 i))) ((=fx i rows)) (let ((m1i (vectorref m1 i)) (row (makevector cols 0.0))) (do ((j 0 (+fx 1 j))) ((=fx j cols)) (let ((val 0.0)) (do ((k 0 (+fx 1 k))) ((=fx k cols)) (set! val (+fl val (*fl (vectorref m1i k) (vectorref (vectorref m2 k) j))))) (vectorset! row j val))) (vectorset! m3 i row))) m3)) (define (domain size) (let ((mm 0.0) (m1 (mkmatrix size size)) (m2 (mkmatrix size size))) (set! mm (mmult size size m1 m2)) (let ((r0 (vectorref mm 0)) (r2 (vectorref mm 2)) (r3 (vectorref mm 3)) (r4 (vectorref mm 4))) (print (vectorref r0 0) " " (vectorref r2 3) " " (vectorref r3 2) " " (vectorref r4 4))))) (domain 512) ==== ========================= Ocaml version original v.: (C) M. Mottl usage: ocamlopt bench.ml time ./a.out ========================= let mkmatrix rows cols = let count = ref 1 and last_col = cols  1 and m = Array.make_matrix rows cols 0.0 in for i = 0 to rows  1 do let mi = m.(i) in for j = 0 to last_col do mi.(j) < float_of_int(!count); incr count done; done; m let rec inner_loop k v m1i m2 j = if k < 0 then v else inner_loop (k  1) (v +. m1i.(k) *. m2.(k).(j)) m1i m2 j let mmult rows cols m1 m2 m3 = let last_col = cols  1 and last_row = rows  1 in for i = 0 to last_row do let m1i = m1.(i) and m3i = m3.(i) in for j = 0 to last_col do m3i.(j) < inner_loop last_row 0.0 m1i m2 j done; done let size = 512 let m1 = mkmatrix size size and m2 = mkmatrix size size let m3 = Array.make_matrix size size 0.0;; mmult size size m1 m2 m3; Printf.printf "%f %f %f %f\n" m3.(0).(0) m3.(2).(3) m3.(3).(2) m3.(4).(4) ==== ======================= C version usage: g++ O6 bench.c time ./a.out ======================= #include <iostream> #include <stdlib.h> using namespace std; double **mkmatrix(int rows, int cols) { int i, j, count = 1; double **m = (double **) malloc(rows * sizeof(double *)); for (i=0; i<rows; i++) { m[i] = (double *) malloc(cols * sizeof(double)); for (j=0; j<cols; j++) { m[i][j] = count++; } } return(m); } void zeromatrix(int rows, int cols, double **m) { int i, j; for (i=0; i<rows; i++) for (j=0; j<cols; j++) m[i][j] = 0; } void freematrix(int rows, double **m) { while (rows > 1) { free(m[rows]); } free(m); } double **mmult(int rows, int cols, double **m1, double **m2, double **m3) { int i, j, k; double val; for (i=0; i<rows; i++) { for (j=0; j<cols; j++) { val = double(0.0); for (k=0; k<cols; k++) { val += m1[i][k] * m2[k][j]; } m3[i][j] = val; } } return(m3); } int main() { int i, SIZE=512; double **m1 = mkmatrix(SIZE, SIZE); double **m2 = mkmatrix(SIZE, SIZE); double **mm = mkmatrix(SIZE, SIZE); mm = mmult(SIZE, SIZE, m1, m2, mm); cout << mm[0][0] << " " << mm[2][3] << " " << mm[3][2] << " " << mm[4][4] << endl; freematrix(SIZE, m1); freematrix(SIZE, m2); freematrix(SIZE, mm); return(0); } ==== ========================= C++ version (C) S. Gonzi usage: g++ O6 bench.cpp time ./a.out ========================= #include <iostream> #include <vector> using namespace std; vector<vector<double> > mkmatrix(int rows, int cols) { double counter=double(0.0); vector<vector<double> > m; for(int i=0; i<rows; i++) { vector<double> row_vec; for(int j=0; j<cols; j++) { counter = counter + 1.0; row_vec.push_back(counter); } m.push_back(row_vec); } return(m); } vector<vector<double> > mmult(int rows, int cols, vector<vector<double> > m1, vector<vector<double> > m2, vector<vector<double> > m3) { double val; for(int i=0; i<rows; i++) { for(int j=0; j<rows; j++) { val = double(0.0); for(int k=0; k<cols; k++) { val += m1[i][k] * m2[k][j]; } m3[i][j] = val; } } return(m3); } int main() { int SIZE=512; vector<vector<double> > m1 = mkmatrix(SIZE,SIZE); vector<vector<double> > m2 = mkmatrix(SIZE,SIZE); vector<vector<double> > mm = mkmatrix(SIZE,SIZE); mm = mmult(SIZE,SIZE,m1,m2,mm); cout << mm[0][0] << " " << mm[2][3] << " " << mm[3][2] << " " << mm[4][4] << endl; return(0); } ==== ============================= Python version usage: time python bench.py ============================= def mkmatrix(rows, cols): count = 1.0 mx = [ None ] * rows for i in xrange(rows): mx[i] = [0] * cols for j in xrange(cols): mx[i][j] = count count += 1.0 return mx def mmult(rows, cols, m1, m2): m3 = [ None ] * rows for i in xrange( rows ): m3[i] = [0] * cols for j in xrange( cols ): val = 0.0 for k in xrange( cols ): val += m1[i][k] * m2[k][j] m3[i][j] = val return m3 size = 512 def main(): m1 = mkmatrix(size, size) m2 = mkmatrix(size, size) mm = mmult(size, size, m1, m2) print mm[0][0], mm[2][3], mm[3][2], mm[4][4] main() ==== ============================= Intel Fortran 90/95 version (C) Siegfried Gonzi usage: ifc O3 bench.f90 timer ./a.out ============================= program main integer:: size=512 double precision, POINTER,dimension(:,:):: m1, m2, mm call mkmatrix(size,size,m1) call mkmatrix(size,size,m2) call mkmatrix(size,size,mm) call mmult(size,size,m1,m2,mm) print *, mm(1,1) deallocate(m1) deallocate(m2) deallocate(mm) contains subroutine mkmatrix(rows, cols,m) integer, intent(in):: rows, cols double precision, POINTER, dimension(:,:):: m double precision:: counter integer:: i,j allocate(m(rows,cols)) counter = 1.0 do i=1,cols do j=1,rows counter = counter + 1.0 m(j,i) = counter end do end do end subroutine mkmatrix subroutine mmult(rows, cols, m1, m2, m3) integer, intent(in):: rows, cols double precision, dimension(rows,cols), intent(in):: m1, m2 double precision, dimension(rows,cols), intent(in out):: m3 double precision:: val integer:: i,j,k do i=1,rows do j=1,rows val = 0.0 do k=1,cols val = val + m1(k,i)*m2(j,k) end do m3(j,i) = val end do end do end subroutine mmult end program main ====  To unsubscribe, mail camllistrequest@inria.fr Archives: http://caml.inria.fr Bug reports: http://caml.inria.fr/bin/camlbugs FAQ: http://caml.inria.fr/FAQ/ Beginner's list: http://groups.yahoo.com/group/ocaml_beginners