[plexippus-xpath-devel] parsed input != serialized output?

David Lichteblau david at lichteblau.com
Sun Nov 28 19:46:11 UTC 2010


Quoting Andrei Stebakov (lispercat at gmail.com):
> Hello
> 
> I wonder if parse/serialize should arrive at the same string given to
> the parser?
> Let's say
> 
> (let ((sink (cxml:make-string-sink)))
>   (stp:serialize (chtml:parse "<p><div>some text</div></p>"
> (stp:make-builder)) sink)
>   (sax:end-document sink))
> 
> I would expect the result to be "<p><div>some text</div></p>", but
> instead it's "<p/><div>some text</div>" (with some <?xml ...>
> headers).

For XML (!), the content model should stay the same -- and even that
cannot be said on a character-by-character basis.  An XML declaration
doesn't affect the content model and can therefore change.  (You can
suppress it explicitly with a keyword argument.)

Also note that you're parsing HTML and writing XML.  Perhaps you would
prefer to write HTML again, i.e. make a sink using chtml:make-xyz
instead of cxml:make-xyz?

> Why would it rearrange the <p> tag in this manner? What other kinds of
> re-arrangement to expect?

This question is very specific to Closure HTML. For comparison, XML
parsers certainly wouldn't do this sort of re-arrangement.

However, Closure HTML tries to follow the HTML DTD.  div (a block
element) in p (itself a block element) isn't permitted in HTML (only
inline content is), so the parser does what browsers would also do, and
tries to "repair" the HTML to bring it closer to the DTD.  (Closure HTML
was written to do this because it was actually part of a web browser,
namely Closure.)

Whether users of a general-purpose parser expect this step is certainly
a different question.  Unfortunately I don't have a ready-to-use patch
to change this behaviour.

A special purpose change for this particular test case is to tweak the
DTD as follows.  Changing the DTD works, because much behaviour of the
parser is actually not programmed in Lisp, but DTD-driven.  (Note that
the DTD is re-parsed only when the fasl loads, e.g. after a restart of
the Lisp.)


diff --git a/resources/dtd/DTD-HTML-4.0-Transitional b/resources/dtd/DTD-HTML-4.0-Transitional
index 82f0a74..f7f6c91 100644
--- a/resources/dtd/DTD-HTML-4.0-Transitional
+++ b/resources/dtd/DTD-HTML-4.0-Transitional
@@ -526,7 +526,7 @@
 
 <!--=================== Paragraphs =======================================-->
 
-<!ELEMENT P - O (%inline;)*            -- paragraph -->
+<!ELEMENT P - O (%block; | %inline;)*            -- paragraph -->
 <!ATTLIST P
   %attrs;                              -- %coreattrs, %i18n, %events --
   %align;                              -- align, text alignment --


d.




More information about the plexippus-xpath-devel mailing list