Browse thread
mboxlib reloaded ;-)
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Oliver Bandel <oliver@f...> |
| Subject: | Re: [Caml-list] mboxlib reloaded ;-) |
On Sat, Apr 28, 2007 at 10:54:06AM +1000, skaller wrote: > On Sat, 2007-04-28 at 01:12 +0200, Oliver Bandel wrote: > > > So, I then checked my mboxlib and saw that it is quite slow, > > compared to what I expected ( expect! I did not tried it > > on my development machine because I have nomutt installed there) > > and even if native-code smuch faster, it's nevertheless slow... > > ...so I thought I have to redesign my scanner-stage. > > (I use Str-module and ocamnllex mixed together; maybe > > using a plain selfwritten OCaml-scanner might be better here). > > Ocamllex generates very fast scanner: it is using > a very high-tech tagged deterministic finite state automaton > with a driver written in C (so no boxing etc processing > text buffers). I doubt you can hand code anything as > fast as Ocamllex in C, let alone in Ocaml. I know that ocamllexis fast. But I call ocamllex many many times from my own functions, and this maybe could be done more elegant / with less calls toocamllex, or maybe I should not lex directly from the channel and better read in a bigger chunk of data into memory and then lex on that. Or maybe I should first scan the whole header and then the body for each mail, and only afterwards scan again the header into seperated lines, when it is already in the RAM. > > You should check the size (number of states) of the generated > lexer. How? > It will run faster with small number of states where > the matrix fits easily in the cache. I think that tehere are not so much states, but so many calls. And maybe creating a list of header-entreies is faster than creating strings with buffer module, because I always call Buffer.add_string and so on and so on, instead of puttng the line onto alist. For the about 100MB mbox there are 2.5 * 10^6 calls to to Buffer.add_string for the header and 1.6 * 10^6 calls to Buffer.add_string for the body, 2.6*10^6 calls to the function lexing.engine, ... I better should not read linewise, it seems. And there are maybe other problems, why it might be slow. I let the lexer read in linewise and count the line-number. That is, because I throw an exception, when I detect a broken mbox file (when a mbox-file ends in the middle of a header). So maybe I do too much and to often. I think there are tooo many calls, not too much states of the lexer. (But you could argue that both things are closely related). Ciao, Oliver