[cxml-devel] seeking suggestion for handling malformed attribute names

Russell Kliese russell at kliese.id.au
Fri Oct 28 12:37:48 UTC 2011


I have been using closure-html and cxml to extract data from web pages. 
Up until now it has worked quite well, but I have come across a web page 
that has quite extraordinarily malformed html markup.

I am looking for suggestions on how to continue parsing even when 
attribute names are malformed. Here is a self-contained example 
containing some of the offending markup:

(require :closure-html)
(require :cxml-stp)
(defvar *html-str*)
(setf *html-str* "<html>
<body>
<table>
<tr><td vertical-align:=\"\" ;=\"\"></td></tr>
</table>
</body>
</html>
")
(chtml:parse *html-str* (cxml-stp:make-builder))

This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could write 
a custom SAX handler to skip attributes with these errors, but is there 
a more simple solution? Should the chtml parser be patched to handle 
this case? Are there any other ways to make parsing more robust? Or is 
this markup beyond hope of repair?

Regards,

Russell




More information about the cxml-devel mailing list