Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
Parsing with two scanners(ources) as input (?)
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Oliver Bandel <oliver@f...>
Subject: Parsing with two scanners(ources) as input (?)

I want to pasre HTML, and with that I mean I want to parse the  
structure of the tags as well as the contents of the data-elements.

At the moment I'm hacking a special parser for this case,
but it's somehow ugly, because I need to hand-code the state machine  
of the parser, and it somehow becomes ugly.

It would be easier and more elegant, if I could combine the  
HTML-tag-parsing together with the text-parsing on the data-elements.

For HTML-parsing I use Nethtml.
For Text-Scanning I use Pcre.

I want to be able to select certain tags and text that will occur at  
certain positions.

For detecting the found tags I look for Nethtml's
   Element (name, args, subnodes)
and for detecting the data-strings I look into Nethtml's
   Data string
with Pcre.

I would like to find out certain data that occurs after ceratin  
sequences in the tree and then look for certain strings inside that   

Any idea on how to create the parser?

I thought about somehow wrapping the stuff and give it to ocamlyacc.
Maybe menhir is better for that task?

At the moment I use the Element-match just to call the recursive  
parser on the next doclist.
All my parsing is using   Data-match and looks up for the contents there.
This is, because the information I want to parse out of the document  
is flat text inside that data-string.
But some of that infomation could also be found via Tag-sequences.

So I'm looking for a possibility to combine both kinds of attempts.

How to do it?