[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

Marco Antoniotti marcoxa at cs.nyu.edu
Tue Dec 1 15:36:19 UTC 2009


On Dec 1, 2009, at 15:55 , David Lichteblau wrote:

> Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
>> 2 - since this is not CXML default behavior, is there a way to get
>> CXML to do the "obvious" thing?
>
> There is no single obvious thing.  You need to define which kind of
> whitespace stripping you want.

AFAIU the DTD specifies how to deal with whitespaces.  The examples in  
the documentation seem to say that.

>  a. Strip all text nodes, including those that have non-whitespace in
>     them?
>
>  b. Strip all text nodes that are made up of whitespace exclusively?
>
>  c. Take text nodes that have non-whitespace and whitespace, and  
> remove
>     the whitespace from them while keeping the non-whitespace?
>
>  d. Same as c, but "compress" such whitespace rather than removing it
>     entirely?
>
>  e. Choose between c and d depending on what the parent element is?
>
>  f. Do b only depending on what the parent element is?
>
> Case study:
>
>  - XSLT basically does b, with a couple of customization features.
>
>  - HTML does e
>
>  - the DTD-based thing is f
>
>> I know that I could possibly remove the TEXT elements by hand, after
>> having built the internal structure; but it does not feel right.
>
> There are two technical approaches to normalize whitespace with  
> cxml's APIs:
>  - Do it on the fly, either in a SAX handler or a KLACKS source
>  - Do it after the fact in the object model or application
>
> The DTD-based thing is implemented as a SAX handler (first approach),
> see cxml/xml/space-normalizer.lisp
>
> XSLT-style normalization is available in Xuriella XSLT, implemented
> using STP; see the function STRIP-STYLESHEET in xuriella/space.lisp.
>
> Note that both implementation types I listed above are done entirely  
> in
> user code.  You don't need to change cxml to implement yet another
> variety of whitespace stripping.
>
> Just copy&paste the code and change it to suit your needs -- or  
> rewrite
> it.  STRIP-STYLESHEET is a total of 23 lines of code long, I think.
>

Ok, that is a lot of work on my part AFAIAC.  I think I understand the  
mechanics of what you are saying, but you are not answering my question.

I gave you the first two lines of theSBML document.  SBML comes with a  
XSchema definition.  I am assuming that having the xsd will be  
equivalent to having the DTD (I think I am right on this) and  
therefore have the correct indication about what is what and how it  
should be parsed.

<?xml version="1.0" encoding="UTF-8"?>
<sbml xmlns="http://www.sbml.org/sbml/level1" level="1" version="1">
...
</sbml>

is what I have.

Can XML be coerced into accessing the xmlns="http://www.sbml.org/sbml/level1 
" (with DRAKMA), understanding it and using it or not?  (Thus -  
hopefully - stripping the TEXT elements automatically?)

If yes, how?

IMHO, it would be quite a plus to be able to deal with a case like  
this automatically (i.e., SBML) without much user intervention,  
especially as a post-processing step.

Cheers

--
Marco Antoniotti


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20091201/3464f196/attachment.html>


More information about the cxml-devel mailing list