[cxml-devel] Corrupted UTF-8 input

Patrick May patrick.may at mac.com
Wed Oct 16 15:39:35 UTC 2013


On Oct 16, 2013, at 11:35 AM, Raymond Wiker <rwiker at gmail.com> wrote:

> On Oct 16, 2013, at 17:12 , Patrick May <patrick.may at mac.com> wrote:
>> Hi,
>> 
>> 	I'm using chtml for a simple experimental web crawler.  I'm occasionally getting this error (Slime output):
>> 
>> 0: (RUNES-ENCODING::XERROR "Corrupted UTF-8 input (initial byte was #b~8,'0B)" 255)
>> 1: (#<STANDARD-METHOD RUNES-ENCODING:DECODE-SEQUENCE ((EQL :UTF-8) T T T T T ...)> :UTF-8 #(255 216 255 0 0 0 ...) 0 3 #(65535 0 0 0 0 0 ...) 0 8191 NIL)
>> 2: (NIL #<Unknown Arguments>)
>> 3: (#<STANDARD-METHOD RUNES::XSTREAM-UNDERFLOW (RUNES:XSTREAM)> #<RUNES:XSTREAM NIL>)
>> 4: (SGML::READ-TOKEN #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>)
>> 5: (SGML::READ-TOKEN* #<RUNES:XSTREAM NIL> #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")>)
>> 6: (SGML:SGML-PARSE #<SGML::DTD (:PUBLIC "-//W3C//DTD HTML 4.0 Frameset//EN")> #<RUNES:XSTREAM NIL>)
>> 7: (CLOSURE-HTML::PARSE-XSTREAM #<RUNES:XSTREAM NIL> #<CLOSURE-HTML:LHTML-BUILDER #x3020032CBF3D>)
>> 
>> Choosing the restart continuation seems to get past it, but I'd like to understand what's going on and how to automatically detect and work around it.
>> 
>> 	Any input appreciated.
> 
> 
> How sure are you that the input is actually in UTF-8 format? What does the "restart" do?

	Thanks for the quick reply.

	I'm crawling websites in the wild, so it's quite likely the error is legitimate.  I call chtml like this:

  (chtml:parse (drakma:http-request url :user-agent :chrome)
               (chtml:make-lhtml-builder)))

I've only been playing with this for a couple of days, so I'm not sure what other options I have for the builder or if I should just catch the error and skip that url.

	Restart appears to skip processing that call and move on to the next url.

Thanks again,

Patrick

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 495 bytes
Desc: Message signed with OpenPGP using GPGMail
URL: <https://mailman.common-lisp.net/pipermail/cxml-devel/attachments/20131016/02620568/attachment.sig>


More information about the cxml-devel mailing list