Browse thread
zcat vs CamlZip
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2006-08-29 (19:49) |
From: | Bárður Árantsson <spam@s...> |
Subject: | Re: zcat vs CamlZip |
Sam Steingold wrote: > Bardur Arantsson wrote: >> Sam Steingold wrote: >>> I read through a huge *.gz file. >>> I have two versions of the code: >> [--snip--] >>> >>> let buf = Buffer.create 1024 >>> let gz_input_line gz_in char_counter line_counter = >>> Buffer.clear buf; >>> let finish () = incr line_counter; Buffer.contents buf in >>> let rec loop () = >>> let ch = Gzip.input_char gz_in in >> >> This is your most likely culprit. Any kind of "do this for every >> character" is usually insanely expensive when you can do it in bulk. >> (This is especially true when needing to do system calls, or if the >> called function cannot be inlined.) >> > > yes, I thought about it, but I assumed that the ocaml gzip module > inlines Gzip.input_char (obviously the gzip module needs an internal > cache so Gzip.input_char does not _always_ translate to a system call, > most of the time it just pops a char from the internal buffer). You can also easily try this in C with fgetc() contrasted with fgets(). The difference is _huge_ even if they both do comparable numbers of syscalls -- assuming that the buffering is identical (I haven't checked, but I think it is a reasonable assumption). In the C case, the inlining is not really guaranteed, but I don't think it is in OCaml either -- though I honestly don't know. You'd have to check the assembler output to see if the call gets inlined. Inlining aside, memory prefecthing probably also makes a difference in favor of reading in bulk and then processing "in bulk". > at any rate, do you really expect that using Gzip.input and then > searching the result for a newline, slicing and dicing to get the > individual input lines, &c &c would be faster? I would guess so, yes. (There may of course be other reasons for a large portion of the difference as others have pointed out.) -- Bardur Arantsson <bardurREMOVE@THISscientician.net> - 'Blackmail' is such an ugly word. I prefer 'extortion'. The X makes it sound cool. Bender, 'Futurama'