[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Marco Antoniotti marcoxa at cs.nyu.edu
Wed Nov 25 13:09:14 UTC 2009


Hi

months ago I posted this problem and I still do not have a solution  
for it.

Note that I do not have a DTD for the file I want to parse (and I  
don't want and cannot write it).

The SBML XML files I want to read start like this:

<?xml version="1.0" encoding="UTF-8"?>
<!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

Am I correct in assuming that CXML would be able to forgo the DTD if  
it could access the www.sbml.org site and find a DTD or a XSD there?

Note that my problem is to get rid of the spurious TEXT elements.

The http://www.sbml.org/sbml/level1 points to a xsd file.

Cheers
--
Marco





On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:

>
> On Feb 6, 2009, at 10:21 , David Lichteblau wrote:
>
>> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>>> I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT
>>> elements containing just #\Newline and #\Tab (or more #\Tab).
>>> This is obviously an artifact of parsing.  (See attached figure  
>>> from a
>>> 15 minutes CXML browser I whipped up)
>>>
>>> What I do not know it's (1) whether this makes sense or not, or (2)
>>> whether it is dependent on my platform (LWM).
>>
>> Certainly -- XML preserves whitespace in character data, except for  
>> CRLF
>> to LF normalization.
>>
>> There are no universally correct rules for whitespace normalization  
>> in
>> XML, and in general, any change to whitespace could change  the  
>> meaning
>> of the document.
>>
>>
>> One rule that is relatively common is to consider whitespace
>> insignificant in "element content", e.g. in places where no
>> non-whitespace text nodes are allowed by the DTD.
>>
>> This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
>> (see http://common-lisp.net/project/cxml/sax.html#misc), which may be
>> helpful in your case.
>>
>> However, note that the limitation to element content means that you
>> actually need to write or find a DTD that matches your document.
>> Without a DTD, this approach doesn't work.
>
> Ok.  I think I understand this.  I'll try the CXML:MAKE-WHITESPACE- 
> NORMALIZER (I need to understand how to use it first).
>
> However, let me ask you this too.  The SBML XML files start like this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>
> Am I correct in assuming that CXML would be able to forgo the DTD if  
> it could access the www.sbml.org site and find a DTD or a XSD there?
>
> Pardon the naïveté of my questions, but I really do not know enough  
> about XML.
>
> Cheers
> --
> Marco
>
>
>
>
>
>
>
>
>>
>>
>>
>> Other approaches are to use HTML rules for whitespace normalization
>> (which are a more tricky to get right though, and cxml does not  
>> provide
>> a ready-to-use function for this) or to discard all whitespace.  It
>> really depends on the schema and application.
>>
>>
>> (Note that we would like to have some support for this in cxml,  
>> because
>> whitespace rules also matter for indentation, and at some point we  
>> would
>> like to have more flexible/correct/useful indentation modes in our
>> serializer.  Whitespace stripping could be considered as a form of
>> indentation, in the sense that it is a "removal of all indentation".
>> But so far, I haven't found the time to implement anything in this
>> direction.)
>>
>>
>> d.
>
> --
> Marco Antoniotti
>
>

--
Marco Antoniotti


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20091125/c17928ed/attachment.html>


More information about the cxml-devel mailing list