Version française
Home     About     Download     Resources     Contact us    
Browse thread
OCamlJIT2 vs. OCamlJIT
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Jon Harrop <jonathandeanharrop@g...>
Subject: RE: ocamlopt LLVM support (Was: [Caml-list] OCamlJIT2 vs. OCamlJIT)
Benedikt wrote:
> Also looking at the GHC work you mentioned, they also start at the Cmm
> level (slightly different C--, but comparable to ocamlopt), with mostly
> the same amount of type information available. And as you said, they
> did quite well in some cases.

Yes. They did well in benchmarks like hailstone:

collatzLen :: Int -> Word32 -> Int
collatzLen c 1 = c
collatzLen c n = collatzLen (c+1) $ if n `mod` 2 == 0 then n `div` 2 else
3*n+1
pmax x n = x `max` (collatzLen 1 n, n)
main = print $ foldl pmax (1,1) [2..1000000]

Because they unbox every int and tuple to recover C-like performance, just
as HLVM does thanks to value types and OCaml cannot and does not. So I would
not presume that OCaml+LLVM would be as effective as GHC+LLVM.

> > And faster tuples, ints, chars, complex numbers, low-dimensional
> > vectors/matrices, hash tables and so on. More types (e.g. int16 and
> > float32). Even arbitrary-precision arithmetic can benefit by keeping
> small
> > numbers unboxed when possible. Bigarrays disappear. The FFI gets
> simpler,
> > easier to use and more powerful (e.g. you can generate AoS layouts
> for
> > OpenGL). The benefits are enormous in the general case but that is
> beside
> > the point here. Moreover, this would never be accepted either because
> it
> > would degrade the performance of applications like Coq. If it is done
> at
> > all, such work must be kept separate from the OCaml compilers.
> 
> ocamlopt already supports float32 and int16, tho they are not exploited
> to the upper layers

But there is no data representation for them in the heap? Could OCaml's GC
handle a boxed float32 or a float32 array without support for that in the
lower layers?

> (don't know why, but I also don't know why one would need them).

The float32 type is used primarily to halve memory requirements and improve
locality but also to facilitate SSE. Vector arithmetic with LLVM can be well
over 4x faster than OCaml thanks to SSE, of course.

> Keeping Int32/Int64 values unboxed would be possible
> similar to what is done with doubles; maybe there is simply no need to
> do (honestly, I've never needed Int32/Int64 so far, int/Natint is
> usually what you want and these are optimized).

Regular int types are useful for applications like pseudo-random number
generators, image processing and signal analysis.

> >> 3. The current garbage collector is mostly straight-forward because
> of
> >> the data representation. No need to carry around/load type
> information,
> >
> > Note that conveying run-time type information also facilitates useful
> > features like generic printing.
> 
> Generic printing is currently implemented using the run-time type
> information (the three bits mentioned below).

I'm not sure what you mean by this. OCaml cannot express generic printing,
by which I mean the ability to print a value of an arbitrary type at
run-time and have it pretty printed appropriately according to its type.

For example, the F# code:

  printf "%A" [1;2;3]

prints "[1;2;3]".

Similarly in HLVM:

  # print(create(3, 1));;
  [|1; 1; 1|]

OCaml's top level can pretty print only because it knows the type of the
expression. Ordinary compiled user code has no such information because the
OCaml only retains partial type information in its data representation so
you cannot do this in OCaml.

> >> you just need the following bits of information: Is it a block in
> the
> >> heap? Does it contain pointers? If so how many? All this information
> is
> >> immediately available with the current data representation (also
> >> improves cache locality of the GC loops).
> >
> > That information is also immediately available with HLVM's design.
> Each
> > value contains a pointer to its run-time type. Each run-time type
> contains a
> > pointer to a "visit" function for the GC that returns the references
> in any
> > value of that type.
> 
> This is indeed possible, yes, and also implemented by most Java VMs;
> each objects ships a type-info pointer. OCaml avoids the type-info
> pointer using clever header field encoding. A matter of taste probably.

Beyond taste, it facilitates useful features like generic printing and
reflection.

> >> So, it is really worth to spend years on a new data representation
> >> (including fixing all C bindings, etc.)? Just to get better floating
> >> point performance? Integer performance will be almost the same,
> >> avoiding the shift/tag bit just replaces an "addq r1, r2; subq $1,
> r2"
> >> with "addq r1, r2"; even doing this thousands of times will not
> cause a
> >> noticeable difference.
> >
> > On the contrary, there are many example of huge performance gains
> other than
> > floating point. The int-based random number generator from the
> SciMark2
> > benchmark is 6x faster with LLVM than with OCaml. Generic hash tables
> are
> > 17x faster in F# than Java because of value types. And so on.
> 
> I doubt that this is really related to the use of "value types", I also
> doubt that the integer performance improvements are related to the
> missing "subq" (must have been an unusual CPU then).

The "subq" is irrelevant, yes. OCaml's poor performance is due to
unnecessary boxing. Value types solve that problem in the general case, i.e.
in the heap as well as on the stack.

> The Java performance is probably related to the GC in most JVMs,

Java's performance is also dire because of boxing. Specifically, the
key-value pairs and often the keys and values themselves. Firstly, that is
due to type erasure instead of reified generics. Secondly, even if you wrote
a type-specialized version the key-value pair still cannot be unboxed
without value types. Thirdly, these limitations of the VM lead to the use of
suboptimal concrete data structures. Specifically, open hashing with chains
instead of closed hashing.

> which is not as
> well optimized for high speed allocation and compaction as the GCs
> found in most runtimes of functional programming languages. But this is
> just guessing, we'd need some facts to be sure what's the cause. I
> suggest you present some proof-of-concept code in C or LLVM, simply
> using malloc() for memory allocation, comparing a straight-forward
> implementation to a "value type" implementation (with everything else
> the same). If there's still a 17x improvement, I promise to rewrite the
> whole OCaml system during the next three months to support value types.

Again, this is most easily studied using HLVM. A similar program using value
types in HLVM that fills an array of unboxed pairs of 32-bit ints looks like
this and takes 0.39s to run:

let rec fill((a, i0, i2) : (int * int) array * int * int) : unit =
  let di = i2 - i0 in
  if di=0 then () else
    if di=1 then
      a.[i0] <- (i0, i2)
    else
      ( let i1 = i0 + di/2 in
        ( fill(a, i0, i1);
          fill(a, i1, i2) ) );;

let n = 10000000 in fill(create(n, (0, 0)), 0, n);;

The following version using a data representation similar to that of OCaml
and Java by boxing the key-value pair that is allocated on the heap using
"malloc" and, consequently, this version takes 2.37s to run:

type keyValue = KeyValue of int * int;;

let rec fill((a, i0, i2) : keyValue array * int * int) : unit =
  let di = i2 - i0 in
  if di=0 then () else
    if di=1 then
      a.[i0] <- (KeyValue(i0, i2))
    else
      ( let i1 = i0 + di/2 in
        ( fill(a, i0, i1);
          fill(a, i1, i2) ) );;

let n = 10000000 in fill(create(n, KeyValue(0, 0)), 0, n);;

So that is already 6× slower and the problem gets worse when you box the
keys and values themselves within the boxed pair.

You might try to blame the poor performance on malloc but OCaml is over 12×
slower with the following similar operation (that I had to do because of its
16Mb limit):

Array.init 1000 (fun i -> Array.init 10000 (fun j ->
  let k = i*1000+j in Int32.of_int k, Int32.of_int k));;

Generational garbage collection further exacerbates the problem by
unnecessarily copying every allocated value from the young to old
generation. Remembered sets also further exacerbate the problem by having
the write barrier record a pointer at every write. Hence the huge
performance discrepancies between OCaml/Java and HLVM/.NET in this context.

Cheers,
Jon.