Browse thread
Parsing with two scanners(ources) as input (?)
- Oliver Bandel
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: | 2010-04-13 (19:15) |
From: | Oliver Bandel <oliver@f...> |
Subject: | Parsing with two scanners(ources) as input (?) |
Hello, I want to pasre HTML, and with that I mean I want to parse the structure of the tags as well as the contents of the data-elements. At the moment I'm hacking a special parser for this case, but it's somehow ugly, because I need to hand-code the state machine of the parser, and it somehow becomes ugly. It would be easier and more elegant, if I could combine the HTML-tag-parsing together with the text-parsing on the data-elements. For HTML-parsing I use Nethtml. For Text-Scanning I use Pcre. I want to be able to select certain tags and text that will occur at certain positions. For detecting the found tags I look for Nethtml's Element (name, args, subnodes) and for detecting the data-strings I look into Nethtml's Data string with Pcre. I would like to find out certain data that occurs after ceratin sequences in the tree and then look for certain strings inside that Data-strings. Any idea on how to create the parser? I thought about somehow wrapping the stuff and give it to ocamlyacc. Maybe menhir is better for that task? At the moment I use the Element-match just to call the recursive parser on the next doclist. All my parsing is using Data-match and looks up for the contents there. This is, because the information I want to parse out of the document is flat text inside that data-string. But some of that infomation could also be found via Tag-sequences. So I'm looking for a possibility to combine both kinds of attempts. How to do it? Oliver