[cxml-devel] Corrupted UTF-8 input

Patrick May patrick.may at mac.com
Wed Oct 16 19:09:22 UTC 2013


On Oct 16, 2013, at 11:55 AM, Raymond Wiker <rwiker at gmail.com> wrote:
> On Oct 16, 2013, at 17:39 , Patrick May <patrick.may at mac.com> wrote:
>> Thanks for the quick reply.
>> 
>> 	I'm crawling websites in the wild, so it's quite likely the error is legitimate.  I call chtml like this:
>> 
>>  (chtml:parse (drakma:http-request url :user-agent :chrome)
>>               (chtml:make-lhtml-builder)))
>> 
>> I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.
>> 
>> 	Restart appears to skip processing that call and move on to the next url.
> 
> Try evaluating just 
> 
>> (drakma:http-request url :user-agent :chrome)
> 
> separately, for one of the urls that fail. That will give you the document, plus a number of other values. One of them, the "headers" (value 3) may contain information about content type and encoding. Compare this with what is actually returned.
> 
> Other things to try:
> 
> 1) add :want-stream t to the call to drakma:http-request. This may give chtml a better chance to pick up the correct encoding.
> 
> 2) add :force-binary t to the call to drakma:http-request.

	I just wrapped it in a handler-case and logged the bad URLs.  Turns out I was chasing links to PNG files (in <a> tags, not <img>).

	Thanks for the input.

Regards,

Patrick

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20131016/dfcf6da2/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20131016/dfcf6da2/attachment.sig>


More information about the cxml-devel mailing list