In my application I have to deal with all sorts of garbage that poses as HTML. In this particular case, had an email that was encoded in Windows-847. The document had the following header:<br><div class="gmail_quote"><div>
<br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<div><font face="courier new, monospace"><html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="<a href="http://schemas.microsoft.com/office/2004/12/omml" target="_blank">http://schemas.microsoft.com/office/2004/12/omml</a>" xmlns="<a href="http://www.w3.org/TR/REC-html40" target="_blank">http://www.w3.org/TR/REC-html40</a>"><head><meta http-equiv=Content-Type <b><font color="#ff0000">content="text/html; charset=windows-874"</font></b>><meta name=Generator content="Microsoft Word 12 (filtered medium)"></font></div>
</blockquote><div><br></div><div>The document then contained some text in thai. When attempting to parse this document, Closure-HTML gave me the following error:</div><div><br></div><blockquote style="margin:0 0 0 40px;border:none;padding:0px">
<div><font face="courier new, monospace">Corrupted UTF-8 input (initial byte was #b10111110)</font></div></blockquote><div><br></div><div>It's pretty clear that this error message is expected, since the input isn't actually in UTF-8. It seems as though Closure-HTML is making an incorrect assumption here. Now, the proper way to handle this case would be to detect the<b> </b>"content" attribute in the "html" tag and reparse the text using the correct encoding.</div>
<div><br></div><div>Is this something that can be easily done? I haven't looked at the necessary code changes to do this myself (yet).</div><div><br></div><div>Regards,</div><div>Elias</div><div><br></div>
</div><br>