Version franaise
Home About Download Resources Contact us
Browse thread
OCaml mentioned in BraveGNUWorld
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Oliver Bandel <oliver@f...>
Subject: Re: [Caml-list] OCaml mentioned in BraveGNUWorld
On Wed, Mar 16, 2005 at 10:25:44PM +0100, Sven Luther wrote:
> On Tue, Mar 15, 2005 at 09:20:54AM +0100, Oliver Bandel wrote:
> > 
> > Hello,
> > 
> > a small tool I wrote (mbox-cleaner) was mentioned in
> > the current BraveGNUWorld.
> > OCaml was mentioned there too and it was written,
> > that "for a scripting language it's fast". ;-)
> > 
> > So maybe OCaml will be more known in the future, if people
> > would do more marketing on it. ;-)
> > 
> > (But other things written in that article were not all correct :(
> >  but ... well...... )
> > 
> > 
> > http://www.linux-magazin.de/Artikel/ausgabe/2005/04/gnuwelt/gnu.html
> > (in german only)
> > 
> > (I don't know why since many months there are no english translations of the BraveGNUWorld ...)...)
> 
> Hi Oliver, ...
> 
> I read the article with interest, and i saw it mentioned spam-filtering too.

Yes. I mentioned spam-filtering to Georg Greve, but as mentioned above,
some things were out of context and therefore it may be irritating
(and seems it IS irritating), what in the BraveGNUWorld was to read.

I mentioned Spam-filtering to Georg Greve in two ways:
1.: in respect to the tool mbox-cleaner it was: It's useful to use that
    tool to filer out multiple mails with same contents (doublettes),
    when you may have a folder that will be the spam-folder.
    If you use mbox-cleaner to clean it up, it may shrinks the spam-folder
    to maybe 0.1 the size or less, than before - depending on how
    many spam with the same mailcontents you received.
    Before filtering my mailaccount on allowed/disallowed mail addresses,
    I had toms of spam that goes to random mail addresses.
    during weeks/months I had many hundred or many thousands (sic!) mails
    with same contents sent to many random addresses on my computer.
    So for a while I saved the spam in a special folder.
    I had written stuff in Perl that splits up mails into different files.
    Then I used my tool "multiple" to throw away files with same body.
    Then I use cat to rebuild a mbox-folder that contains only
    mails with unique body.
   
    Now I can use mbox-cleaner to do that job.

    But as I now filter out unused mailaddresses this is not a typical
    usage for that tool for me now, because pressure by spam has
    shrinked down to about 1% of before. 10 or 20 spammails a day is
    much less (correct english to say "much less"??!) than 800 or
    2000 mails spam a day...

    But when getting mails as PM as well as via a mailing list
   (e.g. Caml-list), but saving them in the same folder (tagging
   mails with "Caml-list" in subject to sort them into the folder,
   I could use that tool to throw out the unnecessary stuff (same body of mail).

2.: My provider uses more than one tool to filter spam, and some can be configured
    by the user. One of these tools is websieve. But I hate the web-GUI for that tool.
    I have to use regexp's (and it seems that stuff doesn't work) and it's ugly
    to use regexps for every task, when they do not provide things like
    character classes... if I have to use regexps for matching instead of
    saying "filter out any charcters with ascii code > 127 and beyond
    32 but allow german umlauts".
    So a stuff that is more high level would be fine.
    So I thought about implementing a tool that can filter on different levels
    of abstraction.
    Not using regexp's for all... not using charcter codes for all...
    not using charcter classes for all... (and so on).
    Instead using THAT description of the mail's contents to filter out,
    that is good to match the stuff that should be filtered.

    For example:
       If there is a lot of Spam that contains a certain cahacter
    or certain character codes than it makes sense to filter out all such
    stuff based on charcodes. (e.g.: if I disabled UTF-8 and I get stupid
    characters on the screen, when displaying the subject of the mail, I know
    it's <> ISO-8859-1 and if I don't want to accept mails with different
    charset, I can filter it out => So I thought about filtering out
    charclasses.
    => filtering on charcode / charcode sets
       --> who cares about regexp's description with octal notation,
           if I can say: "block all stuff that is <> ISO-8859-1" or
           "allow ISO-8859-1 as well as japanese" or sometging like that.

       If there is a certain string (keywords/phrases) that is copntained
    in the spam-mails, then I ignore the charset and look at the contents
    of that mails (e.g. contents of the subject).
    => filtering on strings/phrases
       --> Who cares about a certain charcode, if it is possible to
           allow/disallow certain phrases (maybe directly encoded
           in a certain language charset)...


   And many other ways to describe text data on a higher-level way
   and in a style that depends on the spam-data as well as the
   taste of the one who writes the filter rules...
   
disallow.languages.all # should not allow any char to pass the filter => always SPAM (like let f x  = true)
allow.language.english
allow.language.us
allow.language.german
allow.language.japanese
allow.charset.iso8859-1
disallow.charcode.above.127
disallow.charcode.below.32
disallow.charcode.equals.char.'\t'
allow.charcode.equals.char.''
allow.charcode.equals.char.''
allow.charcode.equals.char.''
allow.charcode.equals.char.''
allow.charcode.equals.char.''
allow.charcode.equals.char.''
allow.charcode.equals.char.''
allow.charcode.equals.char.'\n'
allow.charcode.equals.char.'\r'
disallow.stringmatch."viagra"
disallow.stringmatch."v.i.a.g.r.a"
disallow.subjectline.length.above.120
disallow.subjectline.length.below.3
disallow.subjectline.stringmatch.no_case."  !  !  look here "
disallow.subjectline.re_stringmatch.no_case.".*v.+i.+a.+g.+r.+a"
disallow.body.ask.spamassassin
disallow.body.ask.my_little_spam_filter_that_matches_the_rest # ;-)
disallow.body.ask.your_little_spam_filter_that_matches_the_rest_of_the_rest # ;-)


(or something like that)


And a newer rule overwrites older rules (as in Postscript
new paint overpaints old paint).


Such an approach seems to be more practical in writing
ad-hoc spamfilter rules than to be constrained to have
only classical Regexp-notation, using octal numbers to
encode chars that can't be matched by the regexp's
in an other way.




> 
> How does it compare/interact/whatever to Xavier's spamoracle tool, and do you

I don't know this tool.
I will google for it (or, if you have a link there...)


> think a generic mail library containing mail filtering code would be a good
> thing ? 

Well, I really think such a maillib would be a nice thing.

I'm currently writing a Mbox-module that supports filtering
mechanisms.
During the weekend I had a littlebid time and worked on it.


Filtering mail is easy with this lib.
It's only special for mbox-files because of the lexer (I used
the mbox-lexer from my mbox-cleaner tool).
If providing a different lexer it should be easy to use it
to read/write different mailbox formats too.
OCaml is so powerful that this should be easy to do.


What already is working:

============================================================
  type mbox_t
  type mail_t
  type match_t = Case | NoCase (* case-sensitive/case insensitive pattern matching *)


(* ================================= *)
(* basic functions for mbox-handling *)
(* ================================= *)

  val read :           string -> mbox_t
  val show :           mbox_t -> unit
  val write_force :    mbox_t -> unit  
  val write :          mbox_t -> unit 

  val change_name :    mbox_t -> string -> mbox_t 
  (* maybe "renamed_copy" would be a better name...?! *)

  val append :         mbox_t -> mbox_t -> mbox_t
  val append_rename :  mbox_t -> mbox_t -> string -> mbox_t




(* these two maybe can be romeved now... was helping me in development. *)
(* but maybe it helps other developers (library users) too?! *)

val dummy_matcher :     mail_t -> bool (* always returns true *)
val dummy_nonmatcher :  mail_t -> bool (* always returns false *)



(* ============================ *)
(* FILTERing/MATCHing functions *)
(* ============================ *)

  val filter :         (mail_t -> bool) -> mbox_t -> mbox_t
  val filter_rename :  (mail_t -> bool) -> mbox_t -> string -> mbox_t


  val create_header_match :   (string -> bool) -> (mail_t -> bool)
  val create_body_match :     (string -> bool) -> (mail_t -> bool)

  val match_mail_contains :    char -> (mail_t -> bool)
  val match_header_contains :  char -> (mail_t -> bool)
  val match_body_contains :    char -> (mail_t -> bool)

  val match_mail_regexp :      string -> match_t -> (mail_t -> bool)
  val match_body_regexp :      string -> match_t -> (mail_t -> bool)
  val match_header_regexp :    string -> match_t -> (mail_t -> bool)


  val partition_rename :    (mail_t -> bool) -> mbox_t -> string -> string -> mbox_t * mbox_t 

(* let partition_rename matcher mbox name_matched name_unmatched = (...) *)

============================================================





What I think about implementing (interface may change a lot, because I have
to think about it in more detail):

============================================================


  val multi_partition :    (string * (mail_t -> bool)) list -> mbox_t -> mbox_t list


  val get_mails :     (mail_t -> bool) -> mbox_t -> (string,string) list

(* don't know if I provide these two *)
  val rewind :         mbox_t -> unit              (* => rewinds mbox-pointer to first mail in mbox *)
  val next_mail :      unit -> mail_t              (* => gets next mail from mbox *)


  val create_mail_match :     (string -> bool) -> mail_t -> bool

(* ################ *)
(* RegExp-Matchings *)
(* ################ *)
  val match_mail_contains_crange :    (char * char) -> (mail_t -> bool)
  val match_header_contains_crange :  (char * char) -> (mail_t -> bool)
  val match_body_contains_crange :    (char * char) -> (mail_t -> bool)

  val match_mail_contains_irange :    (int * int) -> (mail_t -> bool)
  val match_header_contains_irange :  (int * int) -> (mail_t -> bool)
  val match_body_contains_irange :    (int * int) -> (mail_t -> bool)

  val header_contains :  string -> mail_t -> bool
  val body_contains :    string -> mail_t -> bool

  val grep :        string -> mail_t -> string list
  val grep_nocase : string -> mail_t -> string list

OR better:
  val grep :        string -> match_t -> mail_t -> string list (* Case/NoCase *)


  val func :       (string -> 'a) -> (mail_t -> 'a)

(* iter/map/filter are preserving the order of the mails *)

  val iter :           (mail_t -> unit) -> mbox_t -> unit 
  val map :            (mail_t -> 'a) -> mbox_t -> 'a list

  val sort :           (string -> string -> int) -> mbox_t -> mbox_t
  
============================================================


Examples of usage:

let filename = "sent-mail.experimental" in (* set mbox-filename *)
let input_mbox = Mbox.read filename in     (* reading mbox-file *)
let rematcher = Mbox.match_body_regexp ".*ocaml" Mbox.NoCase  (* create matcher-function *)
in
               (* next line: splitting the mbox into matching/nonmatching mails *)
  let result = Mbox.partition_rename rematcher mb2 "OCAML-matched" "OCAML-NON-MATCHED" in
  print_endline "MATCHING OCAML:";
  Mbox.show (fst result);    (* prints resulting (matching) mbox to stdout *)
  print_endline "**NOT** MATCHING OCAML:";
  Mbox.show (snd result);    (* prints resulting (not matching) mbox to stdout *)
(* now write new renew sulting mbox-files *)
  Mbox.write (fst result);   (* writes matching mbox (filename: "OCAML-matched" *)
  Mbox.write (snd result)    (* writes non-matching mbox (filename: "OCAML-NON-MATCHED") *)


When using Mbox.multi_partition (not implemented right now),
the tool get's extraordinary strength (IM(H?)O).

The you can implement complex mail/mbox--filtering mechanisms
with only some lines of code. The library provides powerful
functions and adapting some CLI-stuff to the Lib seems
to be the main task to implement a lot of different small
(or some or one big mail/mbox-filtering tool).


As you see above, I want to have matching mechanisms that
work on chars as well as ints as well as strings/regexps
to provide huge flexibility.

And that approach matches the things I want to
have in a Spam-detector-/blocker-tool. So writing my
Mbox-library can go into the direction of Spam-detection,
even if it's not the main intention right now.



Ciao,
   Oliver

P.S.: The charset operations may need to have UTF-8 functions inside
      OCaml. But IMHO that also could be done easily, because it seems
      to me (in very small example code) that OCaml-C binding really is
      easy/simple! :)
      So adding a unicode lib seems not to be a terrifying job.