Version française
Home     About     Download     Resources     Contact us    
Browse thread
Re: large hash tables
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Damien Doligez <damien.doligez@i...>
Subject: Re: [Caml-list] Re: large hash tables

On 2008-02-20, at 06:18, John Caml wrote:

> However, the memory usage is still pretty bad...it takes nearly an
> order of magnitude more memory than the equivalent C++ program. While
> the C++ program required 800 MB, my ocaml program requires roughly 6
> GB. Am I doing something very inefficiently? My revised code appears
> below.

In order to optimize for space, you need to understand how OCaml
represents things in memory.

>        match murList with
>            | m::u::r::[] ->
>                let rFloat = float_of_string r
>                and mInt = int_of_string m
>                and uInt = int_of_string u in
>
>                let newElement = (uInt, rFloat)
>                and oldList = movieMajor.(mInt) in
>                let newList = List.rev_append [newElement] oldList in
>                Array.set movieMajor mInt newList;


Here, you have an array of lists of pairs.

The space used by the array is 1 + length + size of contents.

The space used by a list is 3 * length + size of the contents.

The space used by a pair is 3 + size of contents.

The space used by an int is 0 (it's already counted in the container).

The space used by a float is a bit tricky: it's normally 3, but
float arrays and records of floats are special: it's 1 in this case.

All these numbers are in 4-byte words, assuming a 32-bit architecture.
On a 64-bit machine, the unit is 8-byte words and floats are size 2
(normally) or 0 (in arrays and records of floats).

Here your memory usage will be : 1 + 17770 + 300M + 300M + 200M = 800M,
or about 6.4 GB (assuming you have a 64-bit machine).

You can do much better by using a tuple of arrays instead of an array
of lists of tuples.  If you use three arrays (for m, u, and r), read
all the data in the arrays, and then do a radix sort (and fill a small
indexing array), you'll be down to 2.4 GB of memory.  If that's still
too much, you could use bigarrays with 32-bit ints and floats, and get
down to 1.2 GB.

I don't see how you can get 800M in your C program unless your ints
are 16-bit, in which case you can do it with bigarrays too.

The only problem with all this, is that you'll have to write the
sorting code yourself.

In conclusion OCaml is not really well-suited to tight memory
situations, but usually you can manage.

-- Damien