Version française
Home     About     Download     Resources     Contact us    

This site is updated infrequently. For up-to-date information, please visit the new OCaml website at

Browse thread
XML library for validating MathML
[ Home ] [ Index: by date | by threads ]
[ Search: ]

[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
Date: 2008-09-18 (11:51)
From: Gerd Stolpmann <info@g...>
Subject: Re: [Caml-list] XML library for validating MathML

Am Donnerstag, den 18.09.2008, 10:12 +0100 schrieb Till Varoquaux:
> PXP is tough to work with and feels a bit crazy but it is good with
> standards (It can sort out any DTD's I have ever thrown at it).
> xml-light is, well, very broken (it doesn't even support charcode
> switching). There are several XML parsers in OCaml and I've had a
> stint with a few of them; the only two I would consider using are
> expat and Pxp with a marked preference for the later. PXP can be very
> confusing and feels over engineered at times but it does the job. And
> remember parsing XML is a hard job, much harder than we often give it
> credit for....
> Hats off to Gerd for providing us with a proper parser.

Thanks. Initially, I thought XML is an easy format - because it looks
easy. But the specs are really challenging - full of bad compromises,
and I would expect that a widely adopted standard has to undergo some
evaluation of its practicability before it is published. For instance,
there are very strict rules where whitespace has to be in XML, and where
it must not occur. E.g. <tag x="a"y="b"> is considered as illegal
because of the missing space between the attributes. The whitespace
rules make it practically impossible to use a yacc-generated parser (my
first attempt was ocamlyacc-based, and it sort of worked after
implementing lots of parsing tricks, but it was impossible to fix all
errors, although the XML grammar is quite short after all). There are
further complications in the XML standard, and after all, it is very
difficult to implement it even on the most basic level. So there are
many parsers now out there that do not do that, but rather implement a
subset because this is easier and parsing is faster.

There is much more to say about shortcomings in XML, or the XML
standardization process. It is now an unnecessary complicated
technology. I would advise everybody to use it only when there is no way
around it, e.g. for exchange of structured data between organizations.

I've got now a few hours of sponsorship for PXP. I'll try to improve the
documentation, because there are some parts that need more explanation
(where people feel it is over-engineered, but as Vincent pointed out,
it's the standard that demands it).


> Till
> On Thu, Sep 18, 2008 at 9:38 AM, Vincent Hanquez <> wrote:
> > On Wed, Sep 17, 2008 at 11:58:05AM -0700, Dario Teixeira wrote:
> >> Given a string containing a mathematical expression in the MathML
> >> markup, I need to verify that the expression is indeed valid MathML.
> >> I am therefore looking for an XML library that can verify an expression
> >> against a given DTD.
> >>
> >> Now, I have tried Xml-light, and the code I used is listed below.
> >> Unfortunately, it fails when trying to parse MathML's DTD (it's the
> >> standard DTD from the W3C).  I have tried simpler DTDs, and it does work
> >> with them; am I therefore correct in assuming that Xml-light can only
> >> handle a particular version/subset of DTD features?
> >
> > I don't know about validation (i'll probably suggest looking at PXP tho),
> > but xml-light is very bad for XML compliance. the library is (happily) parsing
> > XML files that it shouldn't, which tell a lots concerning its validation
> > abilities ...
> >
> > for example, the XML supported character range is not even checked:
> >
> > Xml 1.0 specification -- 2.2 Characters
> >
> > Char       ::=          #x9 | #xA | #xD | [#x20-#xD7FF] |
> >                [#xE000-#xFFFD] | [#x10000-#x10FFFF]
> >
> > others problems include (uncomplete list):
> > - complete unicode un-awareness
> > - funny & wrong entities handling
> >
> > --
> > Vincent
> >
> > _______________________________________________
> > Caml-list mailing list. Subscription management:
> >
> > Archives:
> > Beginner's list:
> > Bug reports:
> >
> _______________________________________________
> Caml-list mailing list. Subscription management:
> Archives:
> Beginner's list:
> Bug reports:
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
Phone: +49-6151-153855                  Fax: +49-6151-997714