Version française
Home     About     Download     Resources     Contact us    
Browse thread
[Caml-list] Ocaml-Weblib?
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Gerd Stolpmann <info@g...>
Subject: Re: [Caml-list] Ocaml-Weblib?

Am 2002.09.03 01:24 schrieb(en) Oliver Bandel:
> On Tue, 3 Sep 2002, Gerd Stolpmann wrote:
> 
> [...]
> > No, I have not yet found the time to do it. There are some aspects
> > that cannot be explained in mli files well enough, and a tutorial
> > would be a great thing. E.g. why ocamlnet has an object-oriented
> > layer on top of channels (netchannels). You find them everywhere,
> > but no introduction.
> > 
> > There is an "examples" directory containing some very simple, and
> > some advanced examples, especially for CGI programming.
> 
> Yes, I have seen it.
> The simple CGI-example is very nice. :)
> Looks like easy programming. :)
> The advanced examples are a littlebid too large for
> Ocaml-and-ocamlnet-beginners. ;-)
> 
> I first asked, because I want to write two different
> tools:
> 
> first one: a wget-like tool, which can parse the html-pages
> (if possible, this f..... javascript-stuff too) and
> can download not only recursively, but can also
> select the pages for download e.g. by pattern-matching
> on href-tags (url or text of the link) or by selection
> of filesizes or so.

HTML parsing can be done with Nethtml. Simple example:

Nethtml.parse 
  (new Netchannels.input_string "<HTML><HEAD>...</HEAD><BODY>...</BODY></HTML>")

Returns something like

[ Element("html",[], [ Element("head",[], [ ... ]);
                       Element("body",[], [ ... ]) ) ]

just try it in the toploop.

If there are attributes (e.g. <BODY BGCOLOR="#EEFF54">) you get them
instead of [], e.g. Element("body",["bgcolor", "#EEFF54"], [ ... ]).
Note that all names are returned in lower-case.

If you want to read from a file instead of a string, just use

  new Netchannels.input_channel ch 

instead of input_string (where ch is an open in_channel).

There is no parser for javascript.

To download the HTML pages you can use netclient (distributed
separately). Simple example:

Http_client.Convenience.http_get "http://caml.inria.fr"

returns the contents of this location. If you need the HTTP headers
(sometimes announcing file sizes), you can use

let m = Http_client.Convenience.http_get_message "http://caml.inria.fr" in
let contents = m # get_resp_body () in
let size = m # assoc_resp_header "content-length" in ... (* may raise Not_found *)

If you use http_head_message, only the headers are requested from the server,
so you have the chance to determine the file size before downloading. In my
own experiments I found out that there are HTTP servers that handle HEAD like
GET causing protocol errors. So be prepared that you can get a strange exception
when you call http_head_message.
> 
> And the second tool I wanted to write was a similar tool
> for nntp-protocol: Download by attributes (size, date,
> MsgID, Subject, author, thread-length, ...).
> (I once wrote such stuff (not completed) in Perl
>  and after the program grew more and more, it
>  becomes more and more a mess...).

As far as I know there is no ready-to-use NNTP client. There are important
components for an NNTP client, though. For example, there are parsers for
messages in email format, and there is the working implementation for the
POP protocol that has some similarities.

To parse an email message, you can call Netmime.read_mime_message, e.g.

Netmime.read_mime_message
  (new Netchannels.input_string "subject: xxx\nsize: 50\n...\n\nbody")

This returns a structure like

  (header, `Body(body))

where "header" and "body" are objects:

  header # field "subject"    returns "xxx"
  header # field "size"       returns "50"
  body # value                returns "body"

Note that Netmime.read_mime_message decodes multipart messages by default,
and you can also get something like

  (header, `Parts [ (part1_header, `Body part1_body); ... (partN_header, `Body partN_body)]

or even deeper nested structures. You can control this by passing the argument
multipart_style.

There is also the function Netmime.read_mime_header returning only the
header, but it is a bit more complicated to use. To parse a string:

let header = Netmime.read_mime_header 
               (new Netstream.input_stream
                  (new Netchannels.input_string "..."))

There is another object involved (input_stream) that has no effect if you
read only from a string, but that allows you to read the header from 
non-seekable files (e.g. pipelines or sockets). But this is definitely
a feature for experts.

> 
> So I need access to sockets, some low-level stuff
> (Unix.read) and such, or a good library, which helps
> here. 

See the sources in netpop.ml for an example how to write a "telnet-style"
client. Note that netpop.ml does not use sockets, it expects that the
user of this module passes channels that are already connected sockets.
See the example mbox_list.ml for the socket stuff (very simple).

>I need a library, which can parse me the
> html-pages and maybe nntp-headers, and I want only
> to implement the logic of the tool, and let the
> network stuff programming be the work, that the
> lib can do.
> 
> And I hope the ocamlnet/netstring can help here.
> But if it will be more effort to understand the library
> than writing the networking-code by myself, then
> I will write the sockets-stuff by myself.

I hope this short introduction gives you the right impression of
the library.

Gerd
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany 
gerd@gerd-stolpmann.de          http://www.gerd-stolpmann.de
------------------------------------------------------------
-------------------
To unsubscribe, mail caml-list-request@inria.fr Archives: http://caml.inria.fr
Bug reports: http://caml.inria.fr/bin/caml-bugs FAQ: http://caml.inria.fr/FAQ/
Beginner's list: http://groups.yahoo.com/group/ocaml_beginners