[cxml-devel] Corrupted UTF-8 input
Patrick May
patrick.may at mac.com
Wed Oct 16 15:39:35 UTC 2013
On Oct 16, 2013, at 11:35 AM, Raymond Wiker <rwiker at gmail.com> wrote:
> On Oct 16, 2013, at 17:12 , Patrick May <patrick.may at mac.com> wrote:
>> Hi,
>>
>> I'm using chtml for a simple experimental web crawler. I'm occasionally getting this error (Slime output):
>>
>> 0: (RUNES-ENCODING::XERROR "Corrupted UTF-8 input (initial byte was #b~8,'0B)" 255)
>> 1: (#<STANDARD-METHOD RUNES-ENCODING:DECODE-SEQUENCE ((EQL :UTF-8) T T T T T ...)> :UTF-8 #(255 216 255 0 0 0 ...) 0 3 #(65535 0 0 0 0 0 ...) 0 8191 NIL)
>> 2: (NIL #<Unknown Arguments>)
>> 3: (#<STANDARD-METHOD RUNES::XSTREAM-UNDERFLOW (RUNES:XSTREAM)> #<RUNES:XSTREAM NIL>)
>> 4: (SGML::READ-TOKEN #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>)
>> 5: (SGML::READ-TOKEN* #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>)
>> 6: (SGML:SGML-PARSE #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")> #<RUNES:XSTREAM NIL>)
>> 7: (CLOSURE-HTML::PARSE-XSTREAM #<RUNES:XSTREAM NIL> #<CLOSURE-HTML:LHTML-BUILDER #x3020032CBF3D>)
>>
>> Choosing the restart continuation seems to get past it, but I'd like to understand what's going on and how to automatically detect and work around it.
>>
>> Any input appreciated.
>
>
> How sure are you that the input is actually in UTF-8 format? What does the "restart" do?
Thanks for the quick reply.
I'm crawling websites in the wild, so it's quite likely the error is legitimate. I call chtml like this:
(chtml:parse (drakma:http-request url :user-agent :chrome)
(chtml:make-lhtml-builder)))
I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.
Restart appears to skip processing that call and move on to the next url.
Thanks again,
Patrick
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20131016/02620568/attachment.sig>
More information about the cxml-devel
mailing list