Browse thread
large hash tables
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Richard Jones <rich@a...> |
| Subject: | Re: [Caml-list] large hash tables |
On Tue, Feb 19, 2008 at 03:01:46PM -0800, John Caml wrote:
> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory.
You can do as well as C++, but you need to understand how values are
stored in the OCaml runtime.
In C++ you're using an int and a (single-precision) float, and
presumably you're storing them packed into an array. Something like
this?
struct item { int i; float f; };
struct item array[100000000];
The storage requirements for this are 100 * 10^6 * 8 bytes = 800 MB,
as you say.
In OCaml things are a bit different. Firstly you are using
double-precision floats ("float" in OCaml is the same as double in C),
so those are 8 bytes. Secondly you're using OCaml ints, and since you
must be on a 64 bit machine, those are 64 bits wide (although in OCaml
you can only use 63 of those bits). So the minimum for storing an int
and a float in OCaml is 16 bytes.
It gets worse though because you're using a (int * float) tuple.
These are stored particularly inefficiently in OCaml. There is an 8
byte header, 8 bytes for the int, then the float is a boxed pointer
[stored in a separate allocation] which means an 8 byte pointer,
another 8 byte header plus 8 bytes for the double-precision float. So
if I've got the math right, the cost of storing each tuple would be:
8 + 8 + 8 + 8 + 8
header of tuple int pointer to float header of float float
= 40 bytes.
Storing those in a simple array means another pointer per tuple
(tuples aren't packed into the array, but pointed to), so that's a
total of around:
100 * 10^6 * (8 + 40) = 4.8 GB.
Luckily there are much better ways to store this data in OCaml. If
you want to keep it simple, use two arrays, a 'float array' to store
double-precision floats and an 'int array' to store 63 bit ints.
These are both unboxed in OCaml, so this works out as:
float array: 100 * 10^6 * 8
int array: + 100 * 10^6 * 8 = 1.6 GB
You can do better (the same as C++) by using Bigarray. The basic
saving of Bigarray is that you can store single-precision float and 32
bit integers directly in them. So either go for two Bigarrays, or use
one Bigarray (of int32_elt type and length 200 million elements) and a
module wrapping the details of inserting and getting the pairs (also
the functions Int32.bits_of_float and Int32.float_of_bits will help
you here).
Rich.
--
Richard Jones
Red Hat