New software: Markup - validating XML parser for O'Caml

From: Gerd Stolpmann (Gerd.Stolpmann@darmstadt.netsurf.de)
Date: Mon Aug 23 1999 - 19:19:35 MET DST


From: Gerd Stolpmann <Gerd.Stolpmann@darmstadt.netsurf.de>
To: caml-list@inria.fr
Subject: New software: Markup - validating XML parser for O'Caml
Date: Mon, 23 Aug 1999 19:19:35 +0200
Message-Id: <99082319583305.00491@schneemann>

Hi,

I've written something new that might be interestant for you: A validating XML
parser. Below you find the announcement.

I had some trouble with O'Caml's lexer and parser generators, and I think this
is a point of discussion.

First, generation of lexers and parsers implies a dependency of the lexer on
the parser, i.e. the parser is the logically preceding module. This does
absolutely not match the requirements of XML which includes a macro language.
If you reference a macro (XML calls them "entities"), the contents of the macro
must be inserted into the token stream coming from the lexer. This means that
the parser modifies the behaviour of the lexer, which means that the lexer
is the logically preceding module. I've tried to avoid this dependency, and the
minimal solution seems to be that at least the "token" type definition should
be separately generated in such cases. Illustration:

Current dependency (preceding modules first; "-->" means generation):

        parser.mly --> parser.mli, parser.ml
                         define the "token" type, and the parser functions

        lexer.mll --> lexer.ml
                         uses the "token" type (but not the parser functions)

Suggested dependency:

        parser.mly --> parser_token.mli, parser_token.ml
                         define only the "token" type

        lexer.mll --> lexer.ml
                         uses the "token" type

        parser.mly --> parser.mli parser.ml
                         define only the parser functions

Here, the code in parser.ml can now call functions of the lexer in order to
modify its behaviour.

ocamlyacc should have two new options: -only-token-def, -only-parser-funs. I'm
currently separating the two parts by some "sed" scripts, but this makes
assuptions about the implementation of ocamlyacc.

The second problem was that I wanted to make the whole parser polymorphic. As
an XML parser should be customizable this is an obvious requirement. My XML
parser uses side effects to store definitions; this seems not to be avoidable
unless you accept a totally ineffective parser. As already presented on this
mailing list (August, 13th), the problem can be solved if the module generated
by ocamlyacc is turned into a class with a polymorphic parameter (the other way
would be to use Obj.magic). I think this could be integrated into ocamlyacc,
too.

And now the ANNOUNCEMENT:

Markup is an experimental validating XML parser for O'Caml. It is designed
such that all of the requirements that the w3c demands of an XML parser can
be fulfilled, but this is not yet the case.

Beginning with the positive properties of this package, already the current
(initial) release should be able to parse all valid XML documents and DTDs as
long as they use only ISO-8859-1 characters. This has been tested with lots of
test documents, including all of James Clark's positive test records.

Once the document is parsed, it can be accessed using a class interface. This
interface is still under development and subject to future changes. The
interface allows arbitrary access including transformations. It has a so-called
"extension", i.e. every element of the document has a main object and the
extension. Although there is a default, the extension is thought as the
changeable part of the element class, i.e. you can provide your own extension
and add further properties to the elements.

Note that the class interface does not comply to the DOM standard. I think that
DOM is not applicable to O'Caml. Of course, my interface has a similar task.

Another design feature is that you can configure the parser such that it uses a
different class for every element type. As classes are not first-order objects
in O'Caml, this is done by an exemplar/instance scheme. A hashtable stores for
every element type the exemplar which is simply an object of the class to be
used. If the parser reads an element, it looks up the exemplar by the type of
the element, and forms a clone of it. -- This way, it is possible to group the
processing code for the elements as classes which seems to be very natural.

This distribution contains several examples:

- validate: simply parses a document and prints all error messages
   
- readme: Defines a DTD for simple "README"-like documents, and offers
   conversion to HTML and text files.
   
- xmlforms: This is already a sophisticated application that uses XML as style
   sheet language and data storage format. It shows how a Tk user interface can
   be configured by an XML style, and how data records can be stored using XML.
 
As usual, you can download Markup from:

        http://people.darmstadt.netsurf.de/Gerd.Stolpmann/ocaml/

Updates will be announced in the O'Caml link database,

        http://www.npc.de/ocaml/linkdb/

Gerd

--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------



This archive was generated by hypermail 2b29 : Sun Jan 02 2000 - 11:58:24 MET