Version française
Home     About     Download     Resources     Contact us    
Browse thread
large hash tables
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Francois Rouaix <francois.rouaix@g...>
Subject: Re: [Caml-list] large hash tables
In the resizing code there is a non-tailrec function (insert_bucket). This
is most likely causing the stack overflow, as I can't see any other non tail
recursive function at first glance. Looks like it's not tail rec in order to
maintain an invariant on the order of elements. If that invariant is not
useful to you, you might want to write a slightly different version of the
Hashtbl module, where insert_bucket would be tail rec.
Also, during resizing, memory usage will be twice the memory required for
the table (roughly), since the bucket array remains available until the
resize is completed, so all the bucket contents exist in two versions (old
and new). You might want to stick to a large initial size and do not attempt
resizing.
--f


2008/2/19 John Caml <camljohn42@gmail.com>:

> I need to load roughly 100 million items into a memory based hash table,
> where each key is an integer and each value is an integer and a float. Under
> C++ stdlib's map library, this requires 800 MB of memory. Under my naive
> porting of my C++ program to Ocaml, only 12 million rows get loaded before I
> get the program crashes with the message "Fatal error: exception
> Stack_overflow". At the time of the crash, nearly 2 GB of memory are in use.
> (My machine -- AMD64 architecture -- has 8 GB of memory, so insufficient
> memory is not the issue.)
>
> Two questions:
>
> 1. How can I overcome the Stack_overflow error? (I'm already compiling
> with ocamlopt rather than ocamlc, which seems to have increased the stack
> size from 500 MB to 2 GB.) I'd like to be able to use the full 8 GB on my
> machine.
> 2. How can I implment a more efficient hash table? My Ocaml program seems
> to require 10x more memory than under C++.
>
> My code appears below. Thank you very much.
>
> --John
>
>
> --------------
> exception SplitError
>
> let read_whole_chan chan =
>     let movieMajor = Hashtbl.create 777777 in
>
>     let rec loadLines count =
>         let line = input_line chan in
>         let murList = Pcre.split line in
>         match murList with
>             | m::u::r::[] ->
>                 let rFloat = float_of_string r in
>                 Hashtbl.add movieMajor m (u, rFloat);
>                 if (count mod 10000) == 0 then Printf.printf "count: %d,
> m: %s, u: %s, r: %f \n" count m u rFloat;
>                 loadLines (count + 1)
>             | _ -> raise SplitError
>   in
>
>     try
>         loadLines 0
>     with
>         End_of_file -> ()
>     ;;
>
> let read_whole_file filename =
>     let chan = open_in filename in
>     read_whole_chan chan
>     ;;
>
> let filename = Sys.argv.(1);;
>
> let str = read_whole_file filename;;
>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
>