Version française
Home     About     Download     Resources     Contact us    
Browse thread
mboxlib reloaded ;-)
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Oliver Bandel <oliver@f...>
Subject: Re: [Caml-list] mboxlib reloaded ;-)
On Sat, Apr 28, 2007 at 11:49:51PM +1000, skaller wrote:
> On Sat, 2007-04-28 at 13:44 +0200, Oliver Bandel wrote:
> > On Sat, Apr 28, 2007 at 12:54:53PM +0200, Gabriel Kerneis wrote:
> > > Le Sat, 28 Apr 2007 12:47:47 +0200, Oliver Bandel
> > > <oliver@first.in-berlin.de> a écrit :
> > > > > You should check the size (number of states) of the generated
> > > > > lexer.
> > > > 
> > > > How?
> > > 
> > > It's printed out by ocamllex when you run it on you .mll file.
> > > Regards,
> > 
> > Ah, ok. :)
> > 
> > 
> > 18 states, 261 transitions, table size 1152 bytes.
> > 
> > Does not loooks very huge ;-)
> 
> Lol, no it is tiny. You are probably right, too many calls,
> and too much copying data around. AFAIK Ocaml channels also
> add an extra buffer layer (is that right?) so there's even
> more copying.
> 
> Still, although Ocaml may generate more code than C,
> if your code is reasonably tight it should be cached
> and be fast: function calls are actually quite cheap.
> 
> Here's an idea: you said:
> 
> "For the about 100MB mbox there are 2.5 * 10^6 calls to
> to Buffer.add_string for the header and 1.6 * 10^6 calls
> to Buffer.add_string for the body, 2.6*10^6 calls to the
> function lexing.engine, ..."
> 
> How about NOT storing the body text. Instead, just store
> the integer file offset of the first byte and the length?
[...]

This is what I have foreseen for different kinds of
application. In the parts I have implemented already,
I use completely reading of the data, because
this would be the most general way of handling.
I will be able to do fulltext-research and possibly
very complex analysis of the mailtext, and this is the reason
why I read all things into memory.

To use file-oiffsets would make sense, when the main
functionality would be that of a mailreader, where
you might look into the one or the other mail.
Then reading from disk is not a problem.

But when you want to be able to do a lot of different things,
then always again and again reading from disk might
be much slower than once reading data into memory
and then working on it.

But maybe I provide such a functionality also,
and only read to memory, when I need it.

The main intend of the lib (ys, a library, not a certain
application) was to have a tool at hand, that makes
complex things achievable easy.

Example:


=========================================================
open Mbox

let filename = "./MB1"

let _ =
     print_endline "OK, let's start!";
     let sm = read filename in
     let mymatch = match_header_regexp ".*richard" NoCase
       in
         let result = filter_rename mymatch sm "./MB1.ready"
         in
           write_force result;

     print_endline "OK, ready! :)"

=========================================================

Opens the mbox-file "MB1" and writes
all mails with the "richard"-string in the header
to the mbox-file "MB1.ready".

It could also be used for spam-detection,
word-counts, textanalysis, automatic rearranging of
mbox-files, throwing away multiple mails,
or saving the same mail in multiple folders,
if they belong to more than one theme.
Or doing complex data analysis on the mails.
E.g. for statistical reasons of text-analysis,
maybe similarity-calculations of texts, ...



ocamlc:  31.4 seconds
ocamopt:  7.7 seconds

Doing an exit after reading-stage:
ocamlc:   7.0 seconds
ocamlopt: 2.5 seconds


So, the pattern-matching (using Str-module)
takes much more time than the reading-/scanning.

It's about 100 MB mbox with mostly short mails.
(I could provide more statistical infos, using this lib :))

Doing a 
  $ time cat MB1 > MB2
needs 0.2 seconds

 $ time grep -i richard MB1 > MB2
needs 0.148 seconds.

OK, such flat-file applications are always faster,
as they do not read the data to memory and they do no
checks on the validity of the files (no exception on
broken mbox-files).

But maybe it could be done better.

Parsing-stage:
  Str-module not seems to be the fastest, and ocamllex
  creats ml-files that first have to be compiled....

  Is there no certain library in OCaml, which can be used?
  Or do it have to be developed?

  I think on such a kind of thing:
    http://swtch.com/~rsc/regexp/regexp1.html


  A thing like a runtime-ocamllex, that creates datastructures
  instead of ocaml-code, would make sense, IMHO.

  IS No such thing available already?

Ciao,
   Oliver