[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Marco Antoniotti marcoxa at cs.nyu.edu
Tue Dec 1 14:38:19 UTC 2009


Dear Cyrus


On Dec 1, 2009, at 06:39 , Cyrus Harmon wrote:

>
> Marco,
>
> Maybe the problem isn't that the liveness of the list (although,  
> yes, it isn't particularly active), but rather the phrasing of your  
> question:
>
> "Am I correct in assuming that CXML would be able to forgo the DTD  
> if it could access the www.sbml.org site and find a DTD or a XSD  
> there?"
>
> What are you suggesting CXML should do by ignoring (if that's what  
> you mean by forgo) a DTD that it may or not find at some site?
>
> Do you have a URL to a SBML file with which the folks at home can  
> try to convince CXML to forgo the DTD (whatever that means)?
>
> thanks,

you are right.  I am not very well versed in XML stuff, so my question  
is not really well phrased.

Let's recapitulate: AFAIU, there is a way to tell CXML to use a  
particular DTD in order to make the CXML:WHITESPACE-NORMALIZER work as  
expected.  I.e., drop the extra TEXT elements corresponding to  
newlines and indentation.  There is also a way to tell CXML to use a  
"resolver" downloaded from the net using DRAKMA (or another HTTP  
client library).

Now, in the case of SBML I do not have a DTD.  I have a file that  
starts as

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">

AFAIU, using the xmlns="http://www.sbml.org/sbml/level1" should make  
it possible for CXML to get such document (which eventually is a .xsd)  
maybe using DRAKMA, and therefore make it parse the rest of the file  
automagically dropping the TEXT elements.  In this sense this would  
make CXML "ignore" the DTD.

So my questions are

1 - Is this a correct assumption?
2 - since this is not CXML default behavior, is there a way to get  
CXML to do the "obvious" thing?

I know that I could possibly remove the TEXT elements by hand, after  
having built the internal structure; but it does not feel right.

If reading a XSchema file results in a DTD then CXML would not really  
"ignore it".  I am just wondere if it is possible and where exaclty to  
look in the code base.

Thanks
--
Marco












>
> Cyrus
>
> On Nov 30, 2009, at 3:53 PM, Marco Antoniotti wrote:
>
>> Well...  Any ideas on this or the list is dead?
>>
>> Cheers
>>
>> Marco
>>
>>
>>
>>
>> On Nov 25, 2009, at 14:09 , Marco Antoniotti wrote:
>>
>>> Hi
>>>
>>> months ago I posted this problem and I still do not have a  
>>> solution for it.
>>>
>>> Note that I do not have a DTD for the file I want to parse (and I  
>>> don't want and cannot write it).
>>>
>>> The SBML XML files I want to read start like this:
>>>
>>> <?xml version="1.0" encoding="UTF-8"?>
>>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
>>>
>>> Am I correct in assuming that CXML would be able to forgo the DTD  
>>> if it could access the www.sbml.org site and find a DTD or a XSD  
>>> there?
>>>
>>> Note that my problem is to get rid of the spurious TEXT elements.
>>>
>>> The http://www.sbml.org/sbml/level1 points to a xsd file.
>>>
>>> Cheers
>>> --
>>> Marco
>>>
>>>
>>>
>>>
>>>
>>> On Feb 6, 2009, at 16:34 , Marco Antoniotti wrote:
>>>
>>>>
>>>> On Feb 6, 2009, at 10:21 , David Lichteblau wrote:
>>>>
>>>>> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>>>>>> I get all my actual RUNE-DOM::ELEMENTs interleaved with "bogus"  
>>>>>> TEXT
>>>>>> elements containing just #\Newline and #\Tab (or more #\Tab).
>>>>>> This is obviously an artifact of parsing.  (See attached figure  
>>>>>> from a
>>>>>> 15 minutes CXML browser I whipped up)
>>>>>>
>>>>>> What I do not know it's (1) whether this makes sense or not, or  
>>>>>> (2)
>>>>>> whether it is dependent on my platform (LWM).
>>>>>
>>>>> Certainly -- XML preserves whitespace in character data, except  
>>>>> for CRLF
>>>>> to LF normalization.
>>>>>
>>>>> There are no universally correct rules for whitespace  
>>>>> normalization in
>>>>> XML, and in general, any change to whitespace could change  the  
>>>>> meaning
>>>>> of the document.
>>>>>
>>>>>
>>>>> One rule that is relatively common is to consider whitespace
>>>>> insignificant in "element content", e.g. in places where no
>>>>> non-whitespace text nodes are allowed by the DTD.
>>>>>
>>>>> This rule is implemented by CXML:MAKE-WHITESPACE-NORMALIZER
>>>>> (see http://common-lisp.net/project/cxml/sax.html#misc), which  
>>>>> may be
>>>>> helpful in your case.
>>>>>
>>>>> However, note that the limitation to element content means that  
>>>>> you
>>>>> actually need to write or find a DTD that matches your document.
>>>>> Without a DTD, this approach doesn't work.
>>>>
>>>> Ok.  I think I understand this.  I'll try the CXML:MAKE- 
>>>> WHITESPACE-NORMALIZER (I need to understand how to use it first).
>>>>
>>>> However, let me ask you this too.  The SBML XML files start like  
>>>> this:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <!-- Created by Gepasi 3.30 on March 17, 2003, 12:57 -->
>>>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1"  
>>>> version="1">
>>>>
>>>> Am I correct in assuming that CXML would be able to forgo the DTD  
>>>> if it could access the www.sbml.org site and find a DTD or a XSD  
>>>> there?
>>>>
>>>> Pardon the naïveté of my questions, but I really do not know  
>>>> enough about XML.
>>>>
>>>> Cheers
>>>> --
>>>> Marco
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>
>>>>>
>>>>>
>>>>> Other approaches are to use HTML rules for whitespace  
>>>>> normalization
>>>>> (which are a more tricky to get right though, and cxml does not  
>>>>> provide
>>>>> a ready-to-use function for this) or to discard all whitespace.   
>>>>> It
>>>>> really depends on the schema and application.
>>>>>
>>>>>
>>>>> (Note that we would like to have some support for this in cxml,  
>>>>> because
>>>>> whitespace rules also matter for indentation, and at some point  
>>>>> we would
>>>>> like to have more flexible/correct/useful indentation modes in our
>>>>> serializer.  Whitespace stripping could be considered as a form of
>>>>> indentation, in the sense that it is a "removal of all  
>>>>> indentation".
>>>>> But so far, I haven't found the time to implement anything in this
>>>>> direction.)
>>>>>
>>>>>
>>>>> d.
>>>>
>>>> --
>>>> Marco Antoniotti
>>>>
>>>>
>>>
>>> --
>>> Marco Antoniotti
>>>
>>>
>>
>> --
>> Marco Antoniotti
>>
>>
>>
>
>

--
Marco Antoniotti


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20091201/d1f37688/attachment.html>


More information about the cxml-devel mailing list