[cxml-devel] seeking suggestion for handling malformed attribute names

Fri Oct 28 13:13:25 UTC 2011

Hi,

Quoting Russell Kliese (russell at kliese.id.au):
> I have been using closure-html and cxml to extract data from web
> pages. Up until now it has worked quite well, but I have come across
> a web page that has quite extraordinarily malformed html markup.
> 
> I am looking for suggestions on how to continue parsing even when
> attribute names are malformed. Here is a self-contained example
> containing some of the offending markup:
[...]
> This example give CXML:WELL-FORMEDNESS-VIOLATION. I guess I could
> write a custom SAX handler to skip attributes with these errors, but
> is there a more simple solution? Should the chtml parser be patched
> to handle this case? Are there any other ways to make parsing more
> robust? Or is this markup beyond hope of repair?

thanks for the report.

It should be the chtml parser's responsibility to clean up data
sufficiently that it would be well-formed as XML, and use of an STP
builder as in your code is a pretty good test case to ensure that this
would be the case.

When Closure was written, I think a lot of effort was put into making it
correct errors, but that logic is just not 100% complete.  We'd need
a set of good test cases.

Unfortunately I don't have a ready-to-use patch at this point; do you
have a suggestion?  There are probably little dependency issues, in that
some utility functions needed to determine whether data is OK are
currently part of the XML parser, which the HTML parser does not depend
on in terms of ASDF.  But I'm sure those issues would be solvable.

d.