[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Marco Antoniotti marcoxa at cs.nyu.edu
Fri Feb 6 15:34:20 UTC 2009


On Feb 6, 2009, at 10:21 , David Lichteblau wrote:

> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>> I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT
>> elements containing just #\Newline and #\Tab (or more #\Tab).
>> This is obviously an artifact of parsing.  (See attached figure  
>> from a
>> 15 minutes CXML browser I whipped up)
>>
>> What I do not know it's (1) whether this makes sense or not, or (2)
>> whether it is dependent on my platform (LWM).
>
> Certainly -- XML preserves whitespace in character data, except for  
> CRLF
> to LF normalization.
>
> There are no universally correct rules for whitespace normalization in
> XML, and in general, any change to whitespace could change  the  
> meaning
> of the document.
>
>
> One rule that is relatively common is to consider whitespace
> insignificant in "element content", e.g. in places where no
> non-whitespace text nodes are allowed by the DTD.
>
> This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
> (see http://common-lisp.net/project/cxml/sax.html#misc), which may be
> helpful in your case.
>
> However, note that the limitation to element content means that you
> actually need to write or find a DTD that matches your document.
> Without a DTD, this approach doesn't work.

Ok.  I think I understand this.  I'll try the CXML:MAKE-WHITESPACE- 
NORMALIZER (I need to understand how to use it first).

However, let me ask you this too.  The SBML XML files start like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if  
it could access the www.sbml.org site and find a DTD or a XSD there?

Pardon the naïveté of my questions, but I really do not know enough  
about XML.

Cheers
--
Marco








>
>
>
> Other approaches are to use HTML rules for whitespace normalization
> (which are a more tricky to get right though, and cxml does not  
> provide
> a ready-to-use function for this) or to discard all whitespace.  It
> really depends on the schema and application.
>
>
> (Note that we would like to have some support for this in cxml,  
> because
> whitespace rules also matter for indentation, and at some point we  
> would
> like to have more flexible/correct/useful indentation modes in our
> serializer.  Whitespace stripping could be considered as a form of
> indentation, in the sense that it is a "removal of all indentation".
> But so far, I haven't found the time to implement anything in this
> direction.)
>
>
> d.

--
Marco Antoniotti


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20090206/046e405e/attachment.html>


More information about the cxml-devel mailing list