[cxml-devel] skipping the DTD decl. when `validate' is `nil'

Sean Champ gimmal at gmail.com
Sat Sep 9 14:16:54 UTC 2006


On 09-09-06, David Lichteblau wrote:
> Quoting Sean Champ (gimmal at gmail.com):
>
> > If the input stream is not being validated, the contents of the doctype decl
> > should not matter, for anything -- need not be put, in any part, as an
> > argument to CXML::XSTREAM-OPEN-EXTID. Yet, when the parser is validating, the
> > text of the doctype decl. still must be 'skipped' by the parser.
>
> The DTD is parsed so that entity references can be resolved.

You know, I might apologize about my initial proposal, which now appears as it
having been somewhat naive. I had not recalled that entity declarations might
be found in the content of a DTD.

>
> You cannot skip the doctypedecl entirely: The internal subset must
> always be processed.[1]
>
> > Looking at CXML::P/DOCTYPE-DECL, I'm not sure how to make the parser skip the
> > text of the decl, or what it could possibly return when skipping it. I could
> > appreciate advice on the matter.
>
> It is true that we could skip the external subset.
>
> The XML spec allows non-validating parsers to report but not resolve
> entity references.  ("Note that non-validating processors are not
> obligated to to [sic] read and process entity declarations occurring in
> parameter entities or in the external subset [...]"[2]  And later, "For
> example, a non-validating processor may fail to [...] include the
> replacement text of internal entities"[3]).
>
> That would allow a change like this:
>   * Add a new keyword argument to the parser, perhaps called
>     RESOLVE-ENTITY-REFERENCES, defaulting to T.
>   * NIL allowed only if VALIDATE is also NIL.
>   * If NIL, skip parsing of the external subset and of external entities.
>   * Invent a new SAX event, perhaps called SAX:GENERAL-ENTITY-REFERENCE
>     to report such entity references instead of resolving them.
>   * In the DOM builder, construct an EntityReference accordingly
>     (assuming it is OK to create EntityReference nodes without children
>     just because we do not -have- those children; see below).
>
> The big, bad problem with this, however:
>   * Extending the SAX event START-ELEMENT so that attribute values can
>     contain unsolved entity references is not so attractive.

Looking at the source text of method SAX:START-ELEMENT (DOM-BUILDER T T T T),
it looks like an attribute's value would be carried-in directly in from each
SAX::ATTRIBUTE instance.

Looking at CXML::P/ATT-VALUE, then along the call-tree in that function,
I hope to inquire: Perhaps a non-dereferencing of entity references might be
addressed in CXML::READ-ATT-VALUE, and/or maybe in CXML::RECURSE-ON-ENTITY ?
Would either of those be a good place to start, about it?

I'll admit, I'm still largely unfamiliar with the CXML codebase, and with the
operations of CXML.


>   * Even if we worked around that, when an Attr node has an
>     EntityReference with unknown content, what is supposed to happen
>     when reading the attribute value?  According to the DOM spec, the
>     attribute value is constructed by resolving entity references.  Are
>     we supposed to signal an error then?

I had expected if the DOM spec would *not* require a dereferncing of entity
refs, at the time of construction of DOM objects. I have not been much
familiar with any strictures of DOM, granted.

I had expected that an entity ref would be decoded as some sort of an object
representing an entity ref, then that object would be stored, somehow -- could
be in a sequence, between two strings, for instance.

Yet, indeed, I have expected such that does not match with what they have
specified.


from http://www.w3.org/TR/2004/REC-DOM-Level-3-Core-20040407/core.html#ID-221662474
regarding interface 'Attribute' :

  value of type DOMString
    On retrieval, the value of the attribute is returned as a
    string. Character and general entity references are replaced with their
    values. See also the method getAttribute on the Element interface.

    On setting, this creates a Text node with the unparsed contents of the
    string, i.e. any characters that an XML processor would recognize as
    markup are instead treated as literal text. See also the method
    Element.setAttribute().

    Some specialized implementations, such as some [SVG 1.1] implementations,
    may do normalization automatically, even after mutation; in such case, the
    value on retrieval may differ from the value on setting.

May I say, that is a body of silly requirements. An entity reference anwyhere
in an input 'infoset' should be decoded as it being an entity reference, need
be dereferenced only when the thing must be rendered as for the non-source
text of the thing.

They require a DOM paser to loose meaningful information, such that would be
represented in the input document -- loosing the fact, "this was an entity
reference", replacing the thing at time of parsing.

I cannot take that as it being reasonable -- furthermore, it is inconsistent,
if they do not require entity refs to be dereferenced,  in the core spec to
XML, but then would require else, in DOM.

I would denote that this may be an issue worth putting to one of the W3C
xml-related lists, whichever and wherever might be the list most appropriate,
for it. I cannot be certain it would not be futile for my to mention it,
however. I wonder if I do not observe a phenomenon of some people proceeding
to stick their fingers in their ears, and some of those, making annoying
comments. I'm afraid that the common interest might be most served, if I will
not be be in community to a W3C XML-related mailing list.


>   * And that's only the user-visible side of this problem, internally
>     the parser assumes that attribute values are strings, too.

If I was certain of how to, I should be glad to endeavor to address it, but
what of this matter of what the DOM spec specifies?

Per this proposal: The type of the 'value' property on an 'Att' interface
would be no longer "DOMString". In fact, it would then cease being 100%
compatible with DOM.

Given that DOM appears to include no VECTOR type, I suppose the value type for
the 'Attr.value' property -- if made to keep in parallel with DOM -- would
have to be revised to be of type NodeList. That, then, could amount to a bunch
of mularkey in the system -- a *bunch* of single-element vectors of strings.

Beyond DOM, here, I would propose that the type of the value of an attribute's
'value' slot could be specified as being like so:
  (OR STRING (VECTOR (OR STRING DOM:ENTITY-REFERNCE)))

A value of that type could be coereced to a value of type DOMString, or to a
value of type type NodeList, if and as would be required -- perhaps, using
something of a generalization on CL:COERCE, viz.

 (defgeneric coerce* (instance type)
  (:method ((instance t) (type symbol))
    (coerce* instance (find-class type)))
  #+(or CMU SBCL)
  (:method ((instance t) (type t))
    (coerce* instance
         #+CMU (kernel::specifier-type type)
         #+SBCL (sb-kernel::specifier-type type))) ; ...
  )



> So, I am not at all convinced this would be worth it.  And, as explained
> above, I do not see how to make it work with DOM or even SAX.

How I could regard it as being worth it:

 1) to make an *accurate* rendering of an input doument, as might be via a DOM
    node that would be represented via a CLIM presentation method

 2) to be able to store a DOM node as an *accurate* representation of what was
    in the input document, not with any entity refs derefed.


It appears that it cannot be made to fit with DOM, in how DOM has been
defined, up at L3. I wonder if they might be cheered, if the inconsitency about
the Attr.value slot would occur to their attention.

I cannot find a defintion of the SAX API, even at what is supposed to be the
official homepage for it, http://www.saxproject.org/ . If they have not
defined an event-signaling function that would be called on encountering an
entity ref, perhaps it has been an oversight in design, such that they might
want to address, also.


As for how it might be addressed into implementation in CXML, beside if
CXML::RECURSE-ON-ENTITY might be modified for it -- if I have found the right
function for it, there -- perhaps the parameters to the matter could be
represented as a slot value to an instance of a modified DOM-builder class.

Perhaps there might be defined slots on RUNE-DOM::DOM-BUILDER like as so:

 1) validation policy (slot validate-p ?)

 2) entity-dereferencing policy (slot dereference-entity-p ?)

 3) whitespace-normalization policy (slot normalize-whitespace-p ?)

 4) namespace-handling policy (???-namesapce-???-p ??) *

 5) documentary-schema ? (slot documentary-schema -- could reference a DTD or
    an XSD, if not either a Relax NG XML or Relax NG non-XML thing)

 6) catalogue (?) (initform:  CXML:*CATALOG* )

 7) entity-resolver (entity-resolver ?) (nil or function?)

 8) internal-subset handling policy (allow-internal-subset-p ?)

 9) recoder ? I am not aware of what would be the affects contingent on the
    the RECODE argument to CXML::P/DOCUMENT, so I cannot map that to a slot,
    as here.

 * At present, I'm not aware of what would be handled about namespaces, as
   with a CXML::NAMESPACE-NORMALIZER, in such that would not be handled with
   an instance directly of the class RUNE-DOM::DOM-BUILDER. So, I cannot map
   that to a slot, much.

 With the class DOM-BUILDER being revised as so, then, CXML::P/DOCUMENT could
 be removed of the keyword args, then -- then, extracting the values
 of those args from the 'handler' object.


If a modification CXML::RECURSE-ON-ENTITY would be a part to it: I am not sure
how the entity-deref policy would best be passed through, to that function;
it might be done with an additional argument value, if not with a lexically
scoped variable.

The ent-deref policy flag would have to be provided through such as
CXML::READ-ATT-VALUE (and anything else calling RECURSE-ON-ENTITY) --- then,
fristly, from within CXML::P/ATT-VALUE. The value could be made into that
function, with a new, lexically scoped variable, I suppose, such that could be
bound within a form in CXML::P/DOCUMENT -- however would be the selected name
of the variable, something for it.

Either, I'm not sure of how the parsing would need to be revised, for it, as
in CXML::RECURSE-ON-ENTITY and CXML::READ-ATT-VALUE; I hope I would be able to
figure it out.


Certainly, some of the parser functions would have to be revised, per the
proposed change -- anything parsing an attribute value, namely. If it should
be helpful, then after I'd have a definite start about t, I'd be willing to
fetch the W3C's tests, run through 'em, then revise the changed stuff, until I
could be sure it would work out.

The slot-value type for SAX::ATTRIBUTE-VALUE would be, essentially, changed --
from, say, rod-or-string (?) to (vector (or rod-or-string entity-ref))


> Most people asking about this so far were just too lazy to type "apt-get
> install w3c-dtd-xhtml" anyway...

I was trying to parse a document using the XBEL DTD, and to parse it without
validation. When I was doing so, I noticed that the HTTP URI in the DTD decl
was not handled. I thought that the decl. might be skipped entirely, then.

Granted, I do have an XBEL DTD locally, but
 - not everyone does
 - my Galeon bookmarks file would not be valid on the XBEL DTD, anyway
 - I had not realized the matter  about that entity defs may occur in external
    DTD subsets
 - I thought that I could propose that the DTD decl might be skipped (without
   my having to modify /etc/xml/catalog first)

For what it's worth, it appears that -- in a Debian environment -- the shell
command `update-xmlcatalog' would be the right command to, use for modifying
/etc/xml/catalog or a similar file. That shell command is used in the
w3c-dtd-xhtml package's postinst script. Now, I get to know that much.



If it would be preferred, I could  go over this message body, again, if to
make a really formal design proposal, on the matter. I've been using DocBook,
a lot; it should  be as simple as to make a refentry page about the propsal,
and to submit it via this message list, here.




Danke

--
Sean Champ



> ----
> [1] Well, up to the first reference to an external parameter entity.
> [2] http://www.w3.org/TR/REC-xml/#wf-entdeclared
> [3] http://www.w3.org/TR/REC-xml/#safe-behavior



More information about the cxml-devel mailing list