Version française
Home     About     Download     Resources     Contact us    
Browse thread
large hash tables
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Gerd Stolpmann <info@g...>
Subject: Re: [Caml-list] large hash tables

Am Dienstag, den 19.02.2008, 15:01 -0800 schrieb John Caml:
> I need to load roughly 100 million items into a memory based hash
> table, where each key is an integer and each value is an integer and a
> float. Under C++ stdlib's map library, this requires 800 MB of memory.
> Under my naive porting of my C++ program to Ocaml, only 12 million
> rows get loaded before I get the program crashes with the message
> "Fatal error: exception Stack_overflow". At the time of the crash,
> nearly 2 GB of memory are in use. (My machine -- AMD64 architecture --
> has 8 GB of memory, so insufficient memory is not the issue.)
> 
> Two questions:
> 
> 1. How can I overcome the Stack_overflow error? (I'm already compiling
> with ocamlopt rather than ocamlc, which seems to have increased the
> stack size from 500 MB to 2 GB.) I'd like to be able to use the full 8
> GB on my machine.
> 2. How can I implment a more efficient hash table? My Ocaml program
> seems to require 10x more memory than under C++.
> 
> My code appears below. Thank you very much.

The stack overflow is very strange. I would expect it only for one data
case: that there is a single key that occurs very often (several 10000
times). In this case, there can be a stack overflow when the hash table
is resized. It is possible to work around by handling this case
explicitly.

Gerd


> 
> --John
> 
> 
> --------------
> exception SplitError
> 
> let read_whole_chan chan =
>     let movieMajor = Hashtbl.create 777777 in
> 
>     let rec loadLines count =
>         let line = input_line chan in
>         let murList = Pcre.split line in
>         match murList with
>             | m::u::r::[] ->
>                 let rFloat = float_of_string r in
>                 Hashtbl.add movieMajor m (u, rFloat);
>                 if (count mod 10000) == 0 then Printf.printf "count: %
> d, m: %s, u: %s, r: %f \n" count m u rFloat;
>                 loadLines (count + 1)
>             | _ -> raise SplitError
>   in
>         
>     try
>         loadLines 0
>     with
>         End_of_file -> ()
>     ;;
> 
> let read_whole_file filename =
>     let chan = open_in filename in
>     read_whole_chan chan
>     ;;
> 
> let filename = Sys.argv.(1);;
> 
> let str = read_whole_file filename;;
> 
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
-- 
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714
------------------------------------------------------------