Version française
Home     About     Download     Resources     Contact us    
Browse thread
MetaOcaml and high-performance [was: AST versus Ocaml]
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Jon Harrop <jon@f...>
Subject: Re: [Caml-list] MetaOcaml and high-performance [was: AST versus Ocaml]
On Monday 09 November 2009 04:23:28 oleg@okmij.org wrote:
> Because offshoring produces a portable C or Fortran code file, you can
> use the code on 32 or 64-bit platform. The reason the native MetaOCaml
> without offshoring does not work on amd64 is because at that time
> OCaml didn't emit PIC code for amd64. So, dynamic linking was
> impossible. That problem has long been fixed in later versions of
> OCaml...

Has the problem been fixed in MetaOCaml?

> Fortunately, some people have considered MetaOCaml to be a viable
> option for performance users and have reported good results. For
> example,
>
> 	Tuning MetaOCaml Programs for High Performance
> 	Diploma Thesis of Tobias Langhammer.
> 	http://www.infosun.fmi.uni-passau.de/cl/arbeiten/Langhammer.pdf
>
> Here is a good quotation from the Introduction:
>
> ``This thesis proposes MetaOCaml for enriching the domain of
> high-performance computing by multi-staged programming. MetaOCaml extends
> the OCaml language.
> ...
>     Benchmarks for all presented implementations confirm that the
> execution time can be reduced significantly by high-level
> optimizations. Some MetaOCaml programs even run as fast as respective
> C implementations. Furthermore, in situations where optimizations in
> pure MetaOCaml are limited, computation hotspots can be explicitly or
> implicitly exported to C. This combination of high-level and low-level
> techniques allows optimizations which cannot be obtained in pure C
> without enormous effort.''

That thesis contains three benchmarks:

1. Dense float matrix-matrix multiply.

2. Blur of an int image matrix as convolution with a 3x3 stencil matrix.

3. Polynomial multiplication with distributed parallelism.

I don't know about polynomial multiplication (suffice to say that it is not 
leveraging shared-memory parallelism which is what performance users value in 
today's multicore era) but the code for the first two benchmarks is probably 
10-100x slower than any decent implementation. For example, his fastest 
2048x2048 matrix multiply takes 167s whereas Matlab takes only 3.6s here.

In essence, the performance gain (if any) from offshoring to C or Fortran is 
dwarfed by the lack of shared-memory parallelism.

-- 
Dr Jon Harrop, Flying Frog Consultancy Ltd.
http://www.ffconsultancy.com/?e