[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Marco Antoniotti marcoxa at cs.nyu.edu
Mon Dec 7 09:21:11 UTC 2009


On Dec 5, 2009, at 17:30 , Peter Stirling wrote:

> DTD (Document Type Definition) was the original validation system for
> HTML & XML, it allows you to specify where you can put significant  
> text,
> and the kinds of elements and attributes that can be used together.  
> This
> was fine for HTML, because you aren't doing anything more than showing
> text to a user, and it isn't all that difficult to implement.
>
> However, XML (unlike HTML) was intended for inter-application data
> transfer. Developers using XML this way soon had problems where you  
> want
> to say that a particular bit of text in the document is a number, or
> date, etc.
>
> There were also issues with making XML documents that wanted to  
> contain
> elements from a variety of schema sources. If you had 2 DTDs that you
> wanted to use together then you would have unpleasant issues if they
> contained an element in common.
>
> XML schema was one of the validation alternatives that was created in
> order to bridge this gap, and it became popular enough to be  
> integrated
> in a number of the XML libraries. XML-schema is a superset of DTD, in
> effect, but it uses a very different syntax for its files, so a DTD is
> not a valid XML-schema, and an XML-schema is not a valid DTD.
>
> All of the major java XML bindings will validate with XML-schema, so
> they will allow you to ignore irrelevant whitespace, if you decide  
> to go
> that way.

Which begs the question of having XML-schema incorporated into CXML.    
I have gotten the feeling that this is not easy to set up, although  
it's seem to be TRT.
It is just sad that it is that way.

>
> Looking at http://sbml.org/Software/libSBML they claim to offer lisp
> (among other languages) bindings to their library, which probably  
> makes
> more sense than using CXML (or any other generic XML library) in this
> case.

I did the first CL parsing of SBML using CL-XML (CL-SBML on common- 
lisp.net).  And I had to, at that point, do some post-processing  
massaging.  Next I tried to make a CXML port, and I stumbled upon the  
same problem that did make me ask these questions.

There is another SBML CL binding which uses libSBML via FFI.   Maybe  
that is the way to go as libSBML does some extra validity checking.   
Being where we are (a CL related list) I'd say that that sucks a bit. :)

Cheers
--
Marco







> Finally, the question that made me join the list!
>
>> There are two technical approaches to normalize whitespace with
>> cxml's APIs:
>> - Do it on the fly, either in a SAX handler or a KLACKS source
>> - Do it after the fact in the object model or application
>
> I'm still new with lisp, but I can't see how to make a KLACKS source
> that will remove white-space. The docs mention bridging KLACKS and  
> SAX,
> but the functions that it mentions only appear to allow you to send
> KLACKS events to SAX handlers, and not the other way around?
>
> My problem is a large XML database (owned by another application)  
> that I
> want to pull data out of (and not just once). Building the DOM tree is
> kinda slow, and while CXML does better than Firefox (address space
> exhaustion before it manages to render the file), it's less than  
> ideal,
> particularly when there are large numbers of white-space nodes that I
> immediately want to skip past (which is complicating my KLACKS-based
> recursive descent parser).
>
> On Dec 1, 2009, at 16:48 , David Lichteblau wrote:
>
>
>> On Dec 1, 2009, at 16:48 , David Lichteblau wrote:
>>
>>> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>>>> Ok, that is a lot of work on my part AFAIAC.  I think I understand
>>>> the mechanics of what you are saying, but you are not answering my
>>>> question.
>>>>
>>>> I gave you the first two lines of theSBML document.  SBML comes  
>>>> with
>>>> a XSchema definition.  I am assuming that having the xsd will be
>>>> equivalent to having the DTD (I think I am right on this) and
>>>> therefore have the correct indication about what is what and how it
>>>> should be parsed.
>>>
>>> In general, you cannot translate XML Schema into a DTD.
>>
>> Ok.  It just strenghten my hunches about XML being quite messy.
>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1"  
>>>> version="1">
>>>> ...
>>>> </sbml>
>>>>
>>>> is what I have.
>>>>
>>>> Can XML be coerced into accessing the
>>>> xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA),  
>>>> understanding
>>>> it and using it or not?  (Thus - hopefully - stripping the TEXT
>>>> elements automatically?)
>>>>
>>>> If yes, how?
>>>
>>> No, cxml does not do that.  I has no support for XML Schema at all.
>>>
>>
>> Ok.  But what about xerces or the Java libraries; would they do it?
>>
>> Cheers
>>
>> --
>> Marco
>
>
>

--
Marco Antoniotti






More information about the cxml-devel mailing list