[cxml-devel] seeking suggestion for handling malformed attribute names
Russell Kliese
russell at kliese.id.au
Fri Oct 28 12:37:48 UTC 2011
I have been using closure-html and cxml to extract data from web pages.
Up until now it has worked quite well, but I have come across a web page
that has quite extraordinarily malformed html markup.
I am looking for suggestions on how to continue parsing even when
attribute names are malformed. Here is a self-contained example
containing some of the offending markup:
(require :closure-html)
(require :cxml-stp)
(defvar *html-str*)
(setf *html-str* "<html>
<body>
<table>
<tr><td vertical-align:=\"\" ;=\"\"></td></tr>
</table>
</body>
</html>
")
(chtml:parse *html-str* (cxml-stp:make-builder))
This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could write
a custom SAX handler to skip attributes with these errors, but is there
a more simple solution? Should the chtml parser be patched to handle
this case? Are there any other ways to make parsing more robust? Or is
this markup beyond hope of repair?
Regards,
Russell
More information about the cxml-devel
mailing list