From tuxedo at deepsky.com Thu Nov 10 20:13:07 2011 From: tuxedo at deepsky.com (Keith Browne) Date: Thu, 10 Nov 2011 15:13:07 -0500 (EST) Subject: [closure-devel] Array access out of bounds in Closure HTML's sgml parser Message-ID: We're using Closure HTML and Drakma to extract information from Web pages. We've run across an intermittent fault with one page in particular from YouTube. We had a little difficulty reproducing the bug at first, but we discovered that YouTube was sending us different contents each time. We ran our code in a loop and captured several hundred deliveries of the Web page in question until we got another instance that failed. I've put a copy of the HTML that trips the bug up at http://www.deepsky.com/~tuxedo/youtube-sgml-breaker.html You can see the problem by loading closure-html and drakma and evaluating this form: (chtml:parse (drakma:http-request "http://www.deepsky.com/~tuxedo/youtube-sgml-breaker.html") (chtml:make-lhtml-builder)) On SBCL 1.0.53, I'm getting this error: Index 8192 out of bounds for (SIMPLE-ARRAY CHARACTER (8192)), should be nonnegative and <8192. [Condition of type SB-INT:INVALID-ARRAY-INDEX-ERROR] The error is raised in SGML::READ-LITERAL. I only vaguely understand what's going on in that function. I note that it's raising the error when it's parsing the big block of flashvar-related stuff on line 244 of the HTML file, and if I delete or add an extra character earlier in that line, I can make the error go away. I infer that there's something happening in the character decoding at the point where it needs to grow the buffer that's making it lose, but I can't figure out just what it is. Keith Browne