[cxml-devel] seeking suggestion for handling malformed attribute names

Russell Kliese russell at kliese.id.au
Sat Oct 29 10:43:26 UTC 2011


Hi David,

Thanks for your reply.

It sounds that I would need to do some careful reading of the html and 
xml specs in order to come up with a patch to get parsing right.

It may be possible, however, to solve the problem without having to 
perfect the parser. I found that HTML Tidy can deal with the malformed 
attributes if you use the option that removes proprietary attributes. 
This way only a set of known good attributes are passed to the SAX 
builder. Would this be appropriate for Closure HTML?

If you care to look at Tidy (I was using HTML Tidy for Linux released on 
25 March 2009), here is an example invocation:

echo '<html>
<body>
<table>
<tr><td vertical-align:="" ;=""></td></tr>
</table>
</body>
</html>
' | tidy -q -asxml --force-output yes --drop-proprietary-attributes yes

Regards,

Russell

On 28/10/11 23:13, David Lichteblau wrote:
> Quoting Russell Kliese (russell at kliese.id.au):
>> I am looking for suggestions on how to continue parsing even when
>> attribute names are malformed.
> When Closure was written, I think a lot of effort was put into making it
> correct errors, but that logic is just not 100% complete.  We'd need
> a set of good test cases.
>
> Unfortunately I don't have a ready-to-use patch at this point; do you
> have a suggestion?






More information about the cxml-devel mailing list