Browse thread
zcat vs CamlZip
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Bárður Árantsson <spam@s...> |
| Subject: | Re: zcat vs CamlZip |
Sam Steingold wrote:
> Bardur Arantsson wrote:
>> Sam Steingold wrote:
>>> I read through a huge *.gz file.
>>> I have two versions of the code:
>> [--snip--]
>>>
>>> let buf = Buffer.create 1024
>>> let gz_input_line gz_in char_counter line_counter =
>>> Buffer.clear buf;
>>> let finish () = incr line_counter; Buffer.contents buf in
>>> let rec loop () =
>>> let ch = Gzip.input_char gz_in in
>>
>> This is your most likely culprit. Any kind of "do this for every
>> character" is usually insanely expensive when you can do it in bulk.
>> (This is especially true when needing to do system calls, or if the
>> called function cannot be inlined.)
>>
>
> yes, I thought about it, but I assumed that the ocaml gzip module
> inlines Gzip.input_char (obviously the gzip module needs an internal
> cache so Gzip.input_char does not _always_ translate to a system call,
> most of the time it just pops a char from the internal buffer).
You can also easily try this in C with fgetc() contrasted with fgets().
The difference is _huge_ even if they both do comparable numbers of
syscalls -- assuming that the buffering is identical (I haven't checked,
but I think it is a reasonable assumption). In the C case, the inlining
is not really guaranteed, but I don't think it is in OCaml either --
though I honestly don't know. You'd have to check the assembler output
to see if the call gets inlined.
Inlining aside, memory prefecthing probably also makes a difference in
favor of reading in bulk and then processing "in bulk".
> at any rate, do you really expect that using Gzip.input and then
> searching the result for a newline, slicing and dicing to get the
> individual input lines, &c &c would be faster?
I would guess so, yes.
(There may of course be other reasons for a large portion of the
difference as others have pointed out.)
--
Bardur Arantsson
<bardurREMOVE@THISscientician.net>
- 'Blackmail' is such an ugly word. I prefer 'extortion'. The X
makes it sound cool.
Bender, 'Futurama'