[cxml-devel] seeking suggestion for handling malformed attribute names

David Lichteblau david at lichteblau.com
Sat Oct 29 11:21:26 UTC 2011


Hi,

Quoting Russell Kliese (russell at kliese.id.au):
> It may be possible, however, to solve the problem without having to
> perfect the parser. I found that HTML Tidy can deal with the
> malformed attributes if you use the option that removes proprietary
> attributes. This way only a set of known good attributes are passed
> to the SAX builder. Would this be appropriate for Closure HTML?

I've been using HTML Tidy for several years to clean up HTML.  It's a
very extensive solution, and was helpful to us at the time.  However, I
had to modify its code to fix some of its problems.  The basic issue
seems to be that Tidy is too aggressive: It does not agree with the more
gentle cleanup performed by actual browsers (and as specified through
HTML5).  So if you want to pass W3C validators for HTML 4 Strict, then
Tidy will indeed give output which achieves that goal.  But that output
won't actually match like the input when rendered by an actual
browser...

None of that is an excuse for Closure HTML to be too lenient though;
lack of time has prevented me so far from migrating my HTML Tidy using
codebase to Closure HTML, which would entail working through tests for
proper HTML5-style cleanup.


d.




More information about the cxml-devel mailing list