Version française
Home     About     Download     Resources     Contact us    
Browse thread
New software: Markup - validating XML parser for O'Caml
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: -- (:)
From: Gerd Stolpmann <Gerd.Stolpmann@d...>
Subject: New software: Markup - validating XML parser for O'Caml
Hi,

I've written something new that might be interestant for you: A validating XML
parser. Below you find the announcement.

I had some trouble with O'Caml's lexer and parser generators, and I think this
is a point of discussion. 

First, generation of lexers and parsers implies a dependency of the lexer on
the parser, i.e. the parser is the logically preceding module. This does
absolutely not match the requirements of XML which includes a macro language.
If you reference a macro (XML calls them "entities"), the contents of the macro
must be inserted into the token stream coming from the lexer. This means that
the parser modifies the behaviour of the lexer, which means that the lexer
is the logically preceding module. I've tried to avoid this dependency, and the
minimal solution seems to be that at least the "token" type definition should
be separately generated in such cases. Illustration:

Current dependency (preceding modules first; "-->" means generation):

	parser.mly  -->  parser.mli, parser.ml
                         define the "token" type, and the parser functions

        lexer.mll    --> lexer.ml
                         uses the "token" type (but not the parser functions)

Suggested dependency:

	parser.mly   --> parser_token.mli, parser_token.ml
                         define only the "token" type

        lexer.mll    --> lexer.ml
                         uses the "token" type

        parser.mly   --> parser.mli parser.ml
                         define only the parser functions

Here, the code in parser.ml can now call functions of the lexer in order to
modify its behaviour.

ocamlyacc should have two new options: -only-token-def, -only-parser-funs. I'm
currently separating the two parts by some "sed" scripts, but this makes
assuptions about the implementation of ocamlyacc.

The second problem was that I wanted to make the whole parser polymorphic. As
an XML parser should be customizable this is an obvious requirement. My XML
parser uses side effects to store definitions; this seems not to be avoidable
unless you accept a totally ineffective parser. As already presented on this
mailing list (August, 13th), the problem can be solved if the module generated
by ocamlyacc is turned into a class with a polymorphic parameter (the other way
would be to use Obj.magic). I think this could be integrated into ocamlyacc,
too.

And now the ANNOUNCEMENT:

Markup is an experimental validating XML parser for O'Caml. It is designed 
such that all of the requirements that the w3c demands of an XML parser can 
be fulfilled, but this is not yet the case. 

Beginning with the positive properties of this package, already the current 
(initial) release should be able to parse all valid XML documents and DTDs as 
long as they use only ISO-8859-1 characters. This has been tested with lots of 
test documents, including all of James Clark's positive test records. 

Once the document is parsed, it can be accessed using a class interface. This 
interface is still under development and subject to future changes. The 
interface allows arbitrary access including transformations. It has a so-called 
"extension", i.e. every element of the document has a main object and the 
extension. Although there is a default, the extension is thought as the 
changeable part of the element class, i.e. you can provide your own extension 
and add further properties to the elements. 

Note that the class interface does not comply to the DOM standard. I think that 
DOM is not applicable to O'Caml. Of course, my interface has a similar task. 

Another design feature is that you can configure the parser such that it uses a 
different class for every element type. As classes are not first-order objects 
in O'Caml, this is done by an exemplar/instance scheme. A hashtable stores for 
every element type the exemplar which is simply an object of the class to be 
used. If the parser reads an element, it looks up the exemplar by the type of 
the element, and forms a clone of it. -- This way, it is possible to group the 
processing code for the elements as classes which seems to be very natural. 

This distribution contains several examples:

-  validate: simply parses a document and prints all error messages 
   
-  readme: Defines a DTD for simple "README"-like documents, and offers 
   conversion to HTML and text files. 
   
-  xmlforms: This is already a sophisticated application that uses XML as style 
   sheet language and data storage format. It shows how a Tk user interface can 
   be configured by an XML style, and how data records can be stored using XML. 
 
As usual, you can download Markup from:

	http://people.darmstadt.netsurf.de/Gerd.Stolpmann/ocaml/

Updates will be announced in the O'Caml link database, 

	http://www.npc.de/ocaml/linkdb/


Gerd

--
----------------------------------------------------------------------------
Gerd Stolpmann      Telefon: +49 6151 997705 (privat)
Viktoriastr. 100             
64293 Darmstadt     EMail:   Gerd.Stolpmann@darmstadt.netsurf.de (privat)
Germany                     
----------------------------------------------------------------------------