Browse thread
XML library for validating MathML
[
Home
]
[ Index:
by date
|
by threads
]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
[ Message by date: previous | next ] [ Message in thread: previous | next ] [ Thread: previous | next ]
| Date: | -- (:) |
| From: | Gerd Stolpmann <info@g...> |
| Subject: | Re: [Caml-list] XML library for validating MathML |
Am Donnerstag, den 18.09.2008, 10:58 -0700 schrieb Dario Teixeira:
> Hi,
>
> Well, as it turns out, building a basic "Hello World" in PXP is relatively
> simple (I followed the manual which is very helpful in the beginning).
> However, though the DTD validation works fine with the simple examples I tried,
> it fails for a MathML document. Note that I am using the DTD as provided
> by the W3C, available from here: http://www.w3.org/Math/DTD/mathml2.tgz
>
> When processing the MathML DTD, PXP outputs a few a warnings about entities
> declared twice, about names reserved for future extensions, and quite a
> lot of warnings about code points that cannot be represented. I can ignore
> those for now.
Code points: Note that PXP defaults to ISO-8859-1 as character set. Use
it in UTF-8 mode to get rid of these warnings.
> When it does fail, this is the error produced:
>
> In entity ent-isonum = PUBLIC "-//W3C//ENTITIES Numeric and Special Graphic for MathML 2.0//EN" "isonum.ent", at line 28, position 44:
> Called from entity [dtd] = SYSTEM "mathml2.dtd", line 1969, position 0:
> ERROR (Well-formedness constraint): The character '&' must be written as '&'
>
>
> Looking at the "isonum.ent" file (packaged with the W3C zip), these are
> the contents of line 28, where the error occurs:
>
> <!ENTITY amp "&&" ><!--=ampersand -->
Well, the inner entities are again expanded when an entity is expanded.
The correct way to define & is
<!ENTITY amp "&#x26;">
i.e. no second &. At _definition_ time this gives "&" (the first
& is expanded), and at _use_ time you get finally &. With the wrong
definition you get && at definition time, and this is simply an illegal
character sequence.
PXP defines by default & as "&#38;" which is just the same in
decimal notation, and also recommended by the XML spec.
That W3C docs are erroneous is nothing new, although it is a bit
surprising that they cannot even stick to the basics of their own
formalism. I suppose they used a hacked SGML parser for developing
MathML, since SGML is more liberal about lexical details.
Gerd
>
>
> Though 0x26 is indeed the codepoint for the ampersand character, I don't
> get why it appears twice. Is this a case of double escaping? Could this
> be the reason PXP chokes?
>
> Any thoughts?
>
> Best regards,
> Dario Teixeira
>
> P.S. This is the programme I used for testing. Its code is pretty much
> lifted from the PXP manual:
>
>
> open Pxp_document
> open Pxp_yacc
>
> class warner =
> object
> method warn w = print_endline ("WARNING: " ^ w)
> end
>
> let rec print_structure n =
> let ntype = n#node_type
> in match ntype with
> | T_element name ->
> print_endline ("Element of type " ^ name);
> let children = n # sub_nodes
> in List.iter print_structure children
> | T_data ->
> print_endline "Data"
> | _ ->
> assert false
>
> let () =
> try
> let config = {default_config with warner = new warner} in
> let doc = parse_document_entity config (from_file "test.xml") default_spec
> in print_structure (doc#root)
> with
> exc -> print_endline (Pxp_types.string_of_exn exc)
>
>
>
>
>
> _______________________________________________
> Caml-list mailing list. Subscription management:
> http://yquem.inria.fr/cgi-bin/mailman/listinfo/caml-list
> Archives: http://caml.inria.fr
> Beginner's list: http://groups.yahoo.com/group/ocaml_beginners
> Bug reports: http://caml.inria.fr/bin/caml-bugs
>
--
------------------------------------------------------------
Gerd Stolpmann * Viktoriastr. 45 * 64293 Darmstadt * Germany
gerd@gerd-stolpmann.de http://www.gerd-stolpmann.de
Phone: +49-6151-153855 Fax: +49-6151-997714
------------------------------------------------------------