[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Tue Dec 1 05:39:47 UTC 2009

Marco,

Maybe the problem isn't that the liveness of the list (although, yes, it isn't particularly active), but rather the phrasing of your question:

"Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?"

What are you suggesting CXML should do by ignoring (if that's what you mean by forgo) a DTD that it may or not find at some site?

Do you have a URL to a SBML file with which the folks at home can try to convince CXML to forgo the DTD (whatever that means)?

thanks,

Cyrus

On Nov 30, 2009, at 3:53 PM, Marco Antoniotti wrote:

> Well...  Any ideas on this or the list is dead?
> 
> Cheers
> 
> Marco
> 
> 
> 
> 
> On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:
> 
>> Hi
>> 
>> months ago I posted this problem and I still do not have a solution for it.
>> 
>> Note that I do not have a DTD for the file I want to parse (and I don't want and cannot write it).
>> 
>> The SBML XML files I want to read start like this:
>> 
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>> 
>> Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?
>> 
>> Note that my problem is to get rid of the spurious TEXT elements.
>> 
>> The http://www.sbml.org/sbml/level1 points to a xsd file.
>> 
>> Cheers
>> --
>> Marco
>> 
>> 
>> 
>> 
>> 
>> On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:
>> 
>>> 
>>> On Feb 6, 2009, at 10:21 , David Lichteblau wrote:
>>> 
>>>> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>>>>> I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT
>>>>> elements containing just #\Newline and #\Tab (or more #\Tab).
>>>>> This is obviously an artifact of parsing.  (See attached figure from a
>>>>> 15 minutes CXML browser I whipped up)
>>>>> 
>>>>> What I do not know it's (1) whether this makes sense or not, or (2)
>>>>> whether it is dependent on my platform (LWM).
>>>> 
>>>> Certainly -- XML preserves whitespace in character data, except for CRLF
>>>> to LF normalization.
>>>> 
>>>> There are no universally correct rules for whitespace normalization in
>>>> XML, and in general, any change to whitespace could change  the meaning
>>>> of the document.
>>>> 
>>>> 
>>>> One rule that is relatively common is to consider whitespace
>>>> insignificant in "element content", e.g. in places where no
>>>> non-whitespace text nodes are allowed by the DTD.
>>>> 
>>>> This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
>>>> (see http://common-lisp.net/project/cxml/sax.html#misc), which may be
>>>> helpful in your case.
>>>> 
>>>> However, note that the limitation to element content means that you
>>>> actually need to write or find a DTD that matches your document.
>>>> Without a DTD, this approach doesn't work.
>>> 
>>> Ok.  I think I understand this.  I'll try the CXML:MAKE-WHITESPACE-NORMALIZER (I need to understand how to use it first).
>>> 
>>> However, let me ask you this too.  The SBML XML files start like this:
>>> 
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>>> 
>>> Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?
>>> 
>>> Pardon the naïveté of my questions, but I really do not know enough about XML.
>>> 
>>> Cheers
>>> --
>>> Marco
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>>> 
>>>> 
>>>> 
>>>> Other approaches are to use HTML rules for whitespace normalization
>>>> (which are a more tricky to get right though, and cxml does not provide
>>>> a ready-to-use function for this) or to discard all whitespace.  It
>>>> really depends on the schema and application.
>>>> 
>>>> 
>>>> (Note that we would like to have some support for this in cxml, because
>>>> whitespace rules also matter for indentation, and at some point we would
>>>> like to have more flexible/correct/useful indentation modes in our
>>>> serializer.  Whitespace stripping could be considered as a form of
>>>> indentation, in the sense that it is a "removal of all indentation".
>>>> But so far, I haven't found the time to implement anything in this
>>>> direction.)
>>>> 
>>>> 
>>>> d.
>>> 
>>> --
>>> Marco Antoniotti
>>> 
>>> 
>> 
>> --
>> Marco Antoniotti
>> 
>> 
> 
> --
> Marco Antoniotti
> 
> 
>