Version française
Home     About     Download     Resources     Contact us    
Browse thread
mboxlib reloaded ;-)
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Oliver Bandel <oliver@f...>
Subject: Re: [Caml-list] mboxlib reloaded ;-)
On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote:
> On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote:
> 
> > So, I then checked my mboxlib and saw that it is quite slow,
> > compared to what I expected ( expect! I did not tried it
> > on my development machine because I have nomutt installed there)
> > and even if native-code smuch faster, it's nevertheless slow...
> > ...so I thought I have to redesign my scanner-stage.
> > (I use Str-module and ocamnllex mixed together; maybe
> >  using a plain selfwritten  OCaml-scanner might be better here).
> 
> Ocamllex generates very fast scanner: it is using
> a very high-tech tagged deterministic finite state automaton
> with a driver written in C (so no boxing etc processing
> text buffers). I doubt you can hand code anything as
> fast as Ocamllex in C, let alone in Ocaml.

I know that ocamllexis fast.

But I call ocamllex many many times from my
own functions, and this maybe could be done
more elegant / with less calls toocamllex,
or maybe I should not lex directly from the channel
and better read in a bigger chunk of data
into memory and then lex on that.
Or maybe I should first scan the whole header and
then the body for each mail, and only afterwards
scan again the header into seperated lines,
when it is already in the RAM.


> 
> You should check the size (number of states) of the generated
> lexer.

How?

> It will run faster with small number of states where
> the matrix fits easily in the cache.

I think that tehere are not so much states, but so many calls.

And maybe creating a list of header-entreies is faster than
creating strings with buffer module, because I always call
Buffer.add_string and so on and so on, instead of puttng
the line onto alist.

For the about 100MB mbox there are 2.5 * 10^6 calls to
to Buffer.add_string for the header and 1.6 * 10^6 calls
to Buffer.add_string for the body, 2.6*10^6 calls to the
function lexing.engine, ...

I better should not read linewise, it seems.


And there are maybe other problems, why it might be slow.
I let the lexer read in linewise and count the line-number.
That is, because I throw an exception, when I detect a
broken mbox file (when a mbox-file ends in the middle
of a header).

So maybe I do too much and to often.
I think there are tooo many calls, not too much
states of the lexer.

(But you could argue that both things are closely related).


Ciao,
   Oliver