[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Marco Antoniotti marcoxa at cs.nyu.edu
Mon Nov 30 23:53:28 UTC 2009


Well...  Any ideas on this or the list is dead?

Cheers

Marco




On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:

> Hi
>
> months ago I posted this problem and I still do not have a solution  
> for it.
>
> Note that I do not have a DTD for the file I want to parse (and I  
> don't want and cannot write it).
>
> The SBML XML files I want to read start like this:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>
> Am I correct in assuming that CXML would be able to forgo the DTD if  
> it could access the www.sbml.org site and find a DTD or a XSD there?
>
> Note that my problem is to get rid of the spurious TEXT elements.
>
> The http://www.sbml.org/sbml/level1 points to a xsd file.
>
> Cheers
> --
> Marco
>
>
>
>
>
> On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:
>
>>
>> On Feb 6, 2009, at 10:21 , David Lichteblau wrote:
>>
>>> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>>>> I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus"  
>>>> TEXT
>>>> elements containing just #\Newline and #\Tab (or more #\Tab).
>>>> This is obviously an artifact of parsing.  (See attached figure  
>>>> from a
>>>> 15 minutes CXML browser I whipped up)
>>>>
>>>> What I do not know it's (1) whether this makes sense or not, or (2)
>>>> whether it is dependent on my platform (LWM).
>>>
>>> Certainly -- XML preserves whitespace in character data, except  
>>> for CRLF
>>> to LF normalization.
>>>
>>> There are no universally correct rules for whitespace  
>>> normalization in
>>> XML, and in general, any change to whitespace could change  the  
>>> meaning
>>> of the document.
>>>
>>>
>>> One rule that is relatively common is to consider whitespace
>>> insignificant in "element content", e.g. in places where no
>>> non-whitespace text nodes are allowed by the DTD.
>>>
>>> This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
>>> (see http://common-lisp.net/project/cxml/sax.html#misc), which may  
>>> be
>>> helpful in your case.
>>>
>>> However, note that the limitation to element content means that you
>>> actually need to write or find a DTD that matches your document.
>>> Without a DTD, this approach doesn't work.
>>
>> Ok.  I think I understand this.  I'll try the CXML:MAKE-WHITESPACE- 
>> NORMALIZER (I need to understand how to use it first).
>>
>> However, let me ask you this too.  The SBML XML files start like  
>> this:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>>
>> Am I correct in assuming that CXML would be able to forgo the DTD  
>> if it could access the www.sbml.org site and find a DTD or a XSD  
>> there?
>>
>> Pardon the naïveté of my questions, but I really do not know enough  
>> about XML.
>>
>> Cheers
>> --
>> Marco
>>
>>
>>
>>
>>
>>
>>
>>
>>>
>>>
>>>
>>> Other approaches are to use HTML rules for whitespace normalization
>>> (which are a more tricky to get right though, and cxml does not  
>>> provide
>>> a ready-to-use function for this) or to discard all whitespace.  It
>>> really depends on the schema and application.
>>>
>>>
>>> (Note that we would like to have some support for this in cxml,  
>>> because
>>> whitespace rules also matter for indentation, and at some point we  
>>> would
>>> like to have more flexible/correct/useful indentation modes in our
>>> serializer.  Whitespace stripping could be considered as a form of
>>> indentation, in the sense that it is a "removal of all indentation".
>>> But so far, I haven't found the time to implement anything in this
>>> direction.)
>>>
>>>
>>> d.
>>
>> --
>> Marco Antoniotti
>>
>>
>
> --
> Marco Antoniotti
>
>

--
Marco Antoniotti






More information about the cxml-devel mailing list