Version française
Home     About     Download     Resources     Contact us    
Browse thread
large hash tables
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Richard Jones <rich@a...>
Subject: Re: [Caml-list] large hash tables
On Tue, Feb 19, 2008 at 03:01:46PM -0800, John Caml wrote:
> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory.

You can do as well as C++, but you need to understand how values are
stored in the OCaml runtime.

In C++ you're using an int and a (single-precision) float, and
presumably you're storing them packed into an array.  Something like
this?

  struct item { int i; float f; };
  struct item array[100000000];

The storage requirements for this are 100 * 10^6 * 8 bytes = 800 MB,
as you say.

In OCaml things are a bit different.  Firstly you are using
double-precision floats ("float" in OCaml is the same as double in C),
so those are 8 bytes.  Secondly you're using OCaml ints, and since you
must be on a 64 bit machine, those are 64 bits wide (although in OCaml
you can only use 63 of those bits).  So the minimum for storing an int
and a float in OCaml is 16 bytes.

It gets worse though because you're using a (int * float) tuple.
These are stored particularly inefficiently in OCaml.  There is an 8
byte header, 8 bytes for the int, then the float is a boxed pointer
[stored in a separate allocation] which means an 8 byte pointer,
another 8 byte header plus 8 bytes for the double-precision float.  So
if I've got the math right, the cost of storing each tuple would be:

  8 +              8 +   8 +               8 +              8
  header of tuple  int   pointer to float  header of float  float
  = 40 bytes.

Storing those in a simple array means another pointer per tuple
(tuples aren't packed into the array, but pointed to), so that's a
total of around:

  100 * 10^6 * (8 + 40) = 4.8 GB.

Luckily there are much better ways to store this data in OCaml.  If
you want to keep it simple, use two arrays, a 'float array' to store
double-precision floats and an 'int array' to store 63 bit ints.
These are both unboxed in OCaml, so this works out as:

  float array:   100 * 10^6 * 8
  int array:   + 100 * 10^6 * 8 = 1.6 GB

You can do better (the same as C++) by using Bigarray.  The basic
saving of Bigarray is that you can store single-precision float and 32
bit integers directly in them.  So either go for two Bigarrays, or use
one Bigarray (of int32_elt type and length 200 million elements) and a
module wrapping the details of inserting and getting the pairs (also
the functions Int32.bits_of_float and Int32.float_of_bits will help
you here).

Rich.

-- 
Richard Jones
Red Hat