English version
Accueil     À propos     Téléchargement     Ressources     Contactez-nous    

Ce site est rarement mis à jour. Pour les informations les plus récentes, rendez-vous sur le nouveau site OCaml à l'adresse ocaml.org.

Browse thread
zcat vs CamlZip
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2006-08-29 (19:54)
From: Gerd Stolpmann <info@g...>
Subject: Re: [Caml-list] Re: zcat vs CamlZip
Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold:
> Bardur Arantsson wrote:
> > Sam Steingold wrote:
> >> I read through a huge *.gz file.
> >> I have two versions of the code:
> > [--snip--]
> >>
> >> let buf = Buffer.create 1024
> >> let gz_input_line gz_in char_counter line_counter =
> >>   Buffer.clear buf;
> >>   let finish () = incr line_counter; Buffer.contents buf in
> >>   let rec loop () =
> >>     let ch = Gzip.input_char gz_in in
> > 
> > This is your most likely culprit. Any kind of "do this for every 
> > character" is usually insanely expensive when you can do it in bulk.
> > (This is especially true when needing to do system calls, or if the 
> > called function cannot be inlined.)
> > 
> yes, I thought about it, but I assumed that the ocaml gzip module 
> inlines  Gzip.input_char (obviously the gzip module needs an internal 
> cache so Gzip.input_char does not _always_ translate to a system call, 
> most of the time it just pops a char from the internal buffer).

This may be a godi issue, because gzip.cmx is not installed. Inlining
needs the .cmx file. However, I am not sure whether input_char can be
inlined at all. You can find that out with the dumpapprox tool:

dumpapprox path/to/foo.cmx

Look for the "Approximation" section. If the function (or better entry
point) is listed with the "(inline)" flag it can be inlined, otherwise

> at any rate, do you really expect that using Gzip.input and then 
> searching the result for a newline, slicing and dicing to get the 
> individual input lines, &c &c would be faster?

The question is whether you finally get a loop that can be completely
executed in the CPU's cache, and how many variables need to be read and
written in a loop cycle. Whether functions are inlined or not is usually
not that important. My experience is that the Gzip.input method is

Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
Phone: +49-6151-153855                  Fax: +49-6151-997714