[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab
Cyrus Harmon
ch-lisp at bobobeach.com
Tue Dec 1 05:39:47 UTC 2009
Marco,
Maybe the problem isn't that the liveness of the list (although, yes, it isn't particularly active), but rather the phrasing of your question:
"Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?"
What are you suggesting CXML should do by ignoring (if that's what you mean by forgo) a DTD that it may or not find at some site?
Do you have a URL to a SBML file with which the folks at home can try to convince CXML to forgo the DTD (whatever that means)?
thanks,
Cyrus
On Nov 30, 2009, at 3:53 PM, Marco Antoniotti wrote:
> Well... Any ideas on this or the list is dead?
>
> Cheers
>
> Marco
>
>
>
>
> On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:
>
>> Hi
>>
>> months ago I posted this problem and I still do not have a solution for it.
>>
>> Note that I do not have a DTD for the file I want to parse (and I don't want and cannot write it).
>>
>> The SBML XML files I want to read start like this:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>>
>> Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?
>>
>> Note that my problem is to get rid of the spurious TEXT elements.
>>
>> The http://www.sbml.org/sbml/level1 points to a xsd file.
>>
>> Cheers
>> --
>> Marco
>>
>>
>>
>>
>>
>> On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:
>>
>>>
>>> On Feb 6, 2009, at 10:21 , David Lichteblau wrote:
>>>
>>>> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>>>>> I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus" TEXT
>>>>> elements containing just #\Newline and #\Tab (or more #\Tab).
>>>>> This is obviously an artifact of parsing. (See attached figure from a
>>>>> 15 minutes CXML browser I whipped up)
>>>>>
>>>>> What I do not know it's (1) whether this makes sense or not, or (2)
>>>>> whether it is dependent on my platform (LWM).
>>>>
>>>> Certainly -- XML preserves whitespace in character data, except for CRLF
>>>> to LF normalization.
>>>>
>>>> There are no universally correct rules for whitespace normalization in
>>>> XML, and in general, any change to whitespace could change the meaning
>>>> of the document.
>>>>
>>>>
>>>> One rule that is relatively common is to consider whitespace
>>>> insignificant in "element content", e.g. in places where no
>>>> non-whitespace text nodes are allowed by the DTD.
>>>>
>>>> This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
>>>> (see http://common-lisp.net/project/cxml/sax.html#misc), which may be
>>>> helpful in your case.
>>>>
>>>> However, note that the limitation to element content means that you
>>>> actually need to write or find a DTD that matches your document.
>>>> Without a DTD, this approach doesn't work.
>>>
>>> Ok. I think I understand this. I'll try the CXML:MAKE-WHITESPACE-NORMALIZER (I need to understand how to use it first).
>>>
>>> However, let me ask you this too. The SBML XML files start like this:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>>>
>>> Am I correct in assuming that CXML would be able to forgo the DTD if it could access the www.sbml.org site and find a DTD or a XSD there?
>>>
>>> Pardon the naïveté of my questions, but I really do not know enough about XML.
>>>
>>> Cheers
>>> --
>>> Marco
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>>
>>>>
>>>>
>>>> Other approaches are to use HTML rules for whitespace normalization
>>>> (which are a more tricky to get right though, and cxml does not provide
>>>> a ready-to-use function for this) or to discard all whitespace. It
>>>> really depends on the schema and application.
>>>>
>>>>
>>>> (Note that we would like to have some support for this in cxml, because
>>>> whitespace rules also matter for indentation, and at some point we would
>>>> like to have more flexible/correct/useful indentation modes in our
>>>> serializer. Whitespace stripping could be considered as a form of
>>>> indentation, in the sense that it is a "removal of all indentation".
>>>> But so far, I haven't found the time to implement anything in this
>>>> direction.)
>>>>
>>>>
>>>> d.
>>>
>>> --
>>> Marco Antoniotti
>>>
>>>
>>
>> --
>> Marco Antoniotti
>>
>>
>
> --
> Marco Antoniotti
>
>
>
More information about the cxml-devel
mailing list