From lokedhs at gmail.com Tue Sep 11 10:32:51 2012 From: lokedhs at gmail.com (=?ISO-8859-1?Q?Elias_M=E5rtenson?=) Date: Tue, 11 Sep 2012 18:32:51 +0800 Subject: [closure-devel] Parsing Outlook HTML emails In-Reply-To: References: Message-ID: I am currently faced with the task of parsing HTML emails generated by Outlook. My frustrations with that thing can fill an entire email of its own, so I won't do that. Anyway, one thing it keeps doing is to create lots of non-standard tags of the form and the likes. The problem is that when Closure-HTML parses these, they end up like this: "#BAD TAGp>". I worked around the problem by adding the following check to the function NAME-RUNE-P: (rune= char #/:). This includes the colon as a valid character in a node name, and thus will cause such nodes to be ignored in the generated output. Would it be reasonable to include this fix in an update to Closure-HTML? Regards, Elias -------------- next part -------------- An HTML attachment was scrubbed... URL: From lokedhs at gmail.com Tue Sep 11 10:33:42 2012 From: lokedhs at gmail.com (=?ISO-8859-1?Q?Elias_M=E5rtenson?=) Date: Tue, 11 Sep 2012 18:33:42 +0800 Subject: [closure-devel] Dealing with HTML in a non-UTF-8 encoding In-Reply-To: References: Message-ID: In my application I have to deal with all sorts of garbage that poses as HTML. In this particular case, had an email that was encoded in Windows-847. The document had the following header: The document then contained some text in thai. When attempting to parse this document, Closure-HTML gave me the following error: Corrupted UTF-8 input (initial byte was #b10111110) It's pretty clear that this error message is expected, since the input isn't actually in UTF-8. It seems as though Closure-HTML is making an incorrect assumption here. Now, the proper way to handle this case would be to detect the* *"content" attribute in the "html" tag and reparse the text using the correct encoding. Is this something that can be easily done? I haven't looked at the necessary code changes to do this myself (yet). Regards, Elias -------------- next part -------------- An HTML attachment was scrubbed... URL: