[cxml-devel] skipping the DTD decl. when `validate' is `nil'

David Lichteblau david at lichteblau.com
Sat Sep 9 10:04:26 UTC 2006


Quoting Sean Champ (gimmal at gmail.com):
> I've noticed that the function CXML::P/DOCUMENT calls the function
> CXML::P/DOCTYPE-DECL whether or not CXML::P/DOCUMENT has received a non-nil
> VALIDATE argument. I'd thought to try and make a patch about that, myself, but
> I'm not sure how to address the matter.

This seems to be a bit of an FAQ lately.

> If the input stream is not being validated, the contents of the doctype decl
> should not matter, for anything -- need not be put, in any part, as an
> argument to CXML::XSTREAM-OPEN-EXTID. Yet, when the parser is validating, the
> text of the doctype decl. still must be 'skipped' by the parser.

The DTD is parsed so that entity references can be resolved.

You cannot skip the doctypedecl entirely: The internal subset must
always be processed.[1]

> Looking at CXML::P/DOCTYPE-DECL, I'm not sure how to make the parser skip the
> text of the decl, or what it could possibly return when skipping it. I could
> appreciate advice on the matter.

It is true that we could skip the external subset.

The XML spec allows non-validating parsers to report but not resolve
entity references.  ("Note that non-validating processors are not
obligated to to [sic] read and process entity declarations occurring in
parameter entities or in the external subset [...]"[2]  And later, "For
example, a non-validating processor may fail to [...] include the
replacement text of internal entities"[3]).

That would allow a change like this:
  * Add a new keyword argument to the parser, perhaps called
    RESOLVE-ENTITY-REFERENCES, defaulting to T.
  * NIL allowed only if VALIDATE is also NIL.
  * If NIL, skip parsing of the external subset and of external entities.
  * Invent a new SAX event, perhaps called SAX:GENERAL-ENTITY-REFERENCE
    to report such entity references instead of resolving them.
  * In the DOM builder, construct an EntityReference accordingly
    (assuming it is OK to create EntityReference nodes without children
    just because we do not -have- those children; see below).

The big, bad problem with this, however:
  * Extending the SAX event START-ELEMENT so that attribute values can
    contain unsolved entity references is not so attractive.
  * Even if we worked around that, when an Attr node has an
    EntityReference with unknown content, what is supposed to happen
    when reading the attribute value?  According to the DOM spec, the
    attribute value is constructed by resolving entity references.  Are
    we supposed to signal an error then?
  * And that's only the user-visible side of this problem, internally
    the parser assumes that attribute values are strings, too.

So, I am not at all convinced this would be worth it.  And, as explained
above, I do not see how to make it work with DOM or even SAX.

Most people asking about this so far were just too lazy to type "apt-get
install w3c-dtd-xhtml" anyway...

It would, however, be interesting to learn about other
standards-conforming XML parsers with support for SAX and/or DOM that do
anything like this.  Pointers appreciated.


David
----
[1] Well, up to the first reference to an external parameter entity.
[2] http://www.w3.org/TR/REC-xml/#wf-entdeclared
[3] http://www.w3.org/TR/REC-xml/#safe-behavior



More information about the cxml-devel mailing list