[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab
Marco Antoniotti
marcoxa at cs.nyu.edu
Tue Dec 1 15:36:19 UTC 2009
On Dec 1, 2009, at 15:55 , David Lichteblau wrote:
> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>> 2 - since this is not CXML default behavior, is there a way to get
>> CXML to do the "obvious" thing?
>
> There is no single obvious thing. You need to define which kind of
> whitespace stripping you want.
AFAIU the DTD specifies how to deal with whitespaces. The examples in
the documentation seem to say that.
> a. Strip all text nodes, including those that have non-whitespace in
> them?
>
> b. Strip all text nodes that are made up of whitespace exclusively?
>
> c. Take text nodes that have non-whitespace and whitespace, and
> remove
> the whitespace from them while keeping the non-whitespace?
>
> d. Same as c, but "compress" such whitespace rather than removing it
> entirely?
>
> e. Choose between c and d depending on what the parent element is?
>
> f. Do b only depending on what the parent element is?
>
> Case study:
>
> - XSLT basically does b, with a couple of customization features.
>
> - HTML does e
>
> - the DTD-based thing is f
>
>> I know that I could possibly remove the TEXT elements by hand, after
>> having built the internal structure; but it does not feel right.
>
> There are two technical approaches to normalize whitespace with
> cxml's APIs:
> - Do it on the fly, either in a SAX handler or a KLACKS source
> - Do it after the fact in the object model or application
>
> The DTD-based thing is implemented as a SAX handler (first approach),
> see cxml/xml/space-normalizer.lisp
>
> XSLT-style normalization is available in Xuriella XSLT, implemented
> using STP; see the function STRIP-STYLESHEET in xuriella/space.lisp.
>
> Note that both implementation types I listed above are done entirely
> in
> user code. You don't need to change cxml to implement yet another
> variety of whitespace stripping.
>
> Just copy&paste the code and change it to suit your needs -- or
> rewrite
> it. STRIP-STYLESHEET is a total of 23 lines of code long, I think.
>
Ok, that is a lot of work on my part AFAIAC. I think I understand the
mechanics of what you are saying, but you are not answering my question.
I gave you the first two lines of theSBML document. SBML comes with a
XSchema definition. I am assuming that having the xsd will be
equivalent to having the DTD (I think I am right on this) and
therefore have the correct indication about what is what and how it
should be parsed.
<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
...
</sbml>
is what I have.
Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1
" (with DRAKMA), understanding it and using it or not? (Thus -
hopefully - stripping the TEXT elements automatically?)
If yes, how?
IMHO, it would be quite a plus to be able to deal with a case like
this automatically (i.e., SBML) without much user intervention,
especially as a post-processing step.
Cheers
--
Marco Antoniotti
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20091201/3464f196/attachment.html>
More information about the cxml-devel
mailing list