Browse thread
zcat vs CamlZip
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Gerd Stolpmann <info@g...> |
| Subject: | Re: [Caml-list] Re: zcat vs CamlZip |
Am Dienstag, den 29.08.2006, 15:15 -0400 schrieb Sam Steingold: > Bardur Arantsson wrote: > > Sam Steingold wrote: > >> I read through a huge *.gz file. > >> I have two versions of the code: > > [--snip--] > >> > >> let buf = Buffer.create 1024 > >> let gz_input_line gz_in char_counter line_counter = > >> Buffer.clear buf; > >> let finish () = incr line_counter; Buffer.contents buf in > >> let rec loop () = > >> let ch = Gzip.input_char gz_in in > > > > This is your most likely culprit. Any kind of "do this for every > > character" is usually insanely expensive when you can do it in bulk. > > (This is especially true when needing to do system calls, or if the > > called function cannot be inlined.) > > > > yes, I thought about it, but I assumed that the ocaml gzip module > inlines Gzip.input_char (obviously the gzip module needs an internal > cache so Gzip.input_char does not _always_ translate to a system call, > most of the time it just pops a char from the internal buffer). This may be a godi issue, because gzip.cmx is not installed. Inlining needs the .cmx file. However, I am not sure whether input_char can be inlined at all. You can find that out with the dumpapprox tool: dumpapprox path/to/foo.cmx Look for the "Approximation" section. If the function (or better entry point) is listed with the "(inline)" flag it can be inlined, otherwise not. > at any rate, do you really expect that using Gzip.input and then > searching the result for a newline, slicing and dicing to get the > individual input lines, &c &c would be faster? The question is whether you finally get a loop that can be completely executed in the CPU's cache, and how many variables need to be read and written in a loop cycle. Whether functions are inlined or not is usually not that important. My experience is that the Gzip.input method is faster. Gerd -- ------------------------------------------------------------ Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de Phone: +49-6151-153855 Fax: +49-6151-997714 ------------------------------------------------------------