[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Peter Stirling peter at pjstirling.plus.com
Sat Dec 5 16:30:45 UTC 2009


DTD (Document Type Definition) was the original validation system for
HTML & XML, it allows you to specify where you can put significant text,
and the kinds of elements and attributes that can be used together. This
was fine for HTML, because you aren't doing anything more than showing
text to a user, and it isn't all that difficult to implement.

However, XML (unlike HTML) was intended for inter-application data
transfer. Developers using XML this way soon had problems where you want
to say that a particular bit of text in the document is a number, or
date, etc.

There were also issues with making XML documents that wanted to contain
elements from a variety of schema sources. If you had 2 DTDs that you
wanted to use together then you would have unpleasant issues if they
contained an element in common.

XML schema was one of the validation alternatives that was created in
order to bridge this gap, and it became popular enough to be integrated
in a number of the XML libraries. XML-schema is a superset of DTD, in
effect, but it uses a very different syntax for its files, so a DTD is
not a valid XML-schema, and an XML-schema is not a valid DTD.

All of the major java XML bindings will validate with XML-schema, so
they will allow you to ignore irrelevant whitespace, if you decide to go
that way.

Looking at http://sbml.org/Software/libSBML they claim to offer lisp
(among other languages) bindings to their library, which probably makes
more sense than using CXML (or any other generic XML library) in this
case.

Finally, the question that made me join the list!

> There are two technical approaches to normalize whitespace with  
> cxml's APIs:
>  - Do it on the fly, either in a SAX handler or a KLACKS source
>  - Do it after the fact in the object model or application

I'm still new with lisp, but I can't see how to make a KLACKS source
that will remove white-space. The docs mention bridging KLACKS and SAX,
but the functions that it mentions only appear to allow you to send
KLACKS events to SAX handlers, and not the other way around?

My problem is a large XML database (owned by another application) that I
want to pull data out of (and not just once). Building the DOM tree is
kinda slow, and while CXML does better than Firefox (address space
exhaustion before it manages to render the file), it's less than ideal,
particularly when there are large numbers of white-space nodes that I
immediately want to skip past (which is complicating my KLACKS-based
recursive descent parser).

On Dec 1, 2009, at 16:48 , David Lichteblau wrote:


> On Dec 1, 2009, at 16:48 , David Lichteblau wrote:
> 
> > Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
> >> Ok, that is a lot of work on my part AFAIAC.  I think I understand
> >> the mechanics of what you are saying, but you are not answering my
> >> question.
> >>
> >> I gave you the first two lines of theSBML document.  SBML comes with
> >> a XSchema definition.  I am assuming that having the xsd will be
> >> equivalent to having the DTD (I think I am right on this) and
> >> therefore have the correct indication about what is what and how it
> >> should be parsed.
> >
> > In general, you cannot translate XML Schema into a DTD.
> 
> Ok.  It just strenghten my hunches about XML being quite messy.
> 
> >> <?xml version="1.0" encoding="UTF-8"?>
> >> <sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
> >> ...
> >> </sbml>
> >>
> >> is what I have.
> >>
> >> Can XML be coerced into accessing the
> >> xmlns="http://www.sbml.org/sbml/level1" (with DRAKMA), understanding
> >> it and using it or not?  (Thus - hopefully - stripping the TEXT
> >> elements automatically?)
> >>
> >> If yes, how?
> >
> > No, cxml does not do that.  I has no support for XML Schema at all.
> >
> 
> Ok.  But what about xerces or the Java libraries; would they do it?
> 
> Cheers
> 
> --
> Marco






More information about the cxml-devel mailing list