[cxml-devel] Parsing yields TEXT elements with #\Newline #\Tab

David Lichteblau david at lichteblau.com
Tue Dec 1 14:55:57 UTC 2009


Quoting Marco Antoniotti (marcoxa at cs.nyu.edu):
> 2 - since this is not CXML default behavior, is there a way to get
> CXML to do the "obvious" thing?

There is no single obvious thing.  You need to define which kind of
whitespace stripping you want.

  a. Strip all text nodes, including those that have non-whitespace in
     them?

  b. Strip all text nodes that are made up of whitespace exclusively?

  c. Take text nodes that have non-whitespace and whitespace, and remove
     the whitespace from them while keeping the non-whitespace?

  d. Same as c, but "compress" such whitespace rather than removing it
     entirely?

  e. Choose between c and d depending on what the parent element is?

  f. Do b only depending on what the parent element is?

Case study:

  - XSLT basically does b, with a couple of customization features.

  - HTML does e

  - the DTD-based thing is f

> I know that I could possibly remove the TEXT elements by hand, after
> having built the internal structure; but it does not feel right.

There are two technical approaches to normalize whitespace with cxml's APIs:
  - Do it on the fly, either in a SAX handler or a KLACKS source
  - Do it after the fact in the object model or application

The DTD-based thing is implemented as a SAX handler (first approach),
see cxml/xml/space-normalizer.lisp

XSLT-style normalization is available in Xuriella XSLT, implemented
using STP; see the function STRIP-STYLESHEET in xuriella/space.lisp.

Note that both implementation types I listed above are done entirely in
user code.  You don't need to change cxml to implement yet another
variety of whitespace stripping. 

Just copy&paste the code and change it to suit your needs -- or rewrite
it.  STRIP-STYLESHEET is a total of 23 lines of code long, I think.


d.




More information about the cxml-devel mailing list