[closure-devel] Array access out of bounds in Closure HTML's sgml parser
Keith Browne
tuxedo at deepsky.com
Thu Nov 10 20:13:07 UTC 2011
We're using Closure HTML and Drakma to extract information from Web pages.
We've run across an intermittent fault with one page in particular from
YouTube. We had a little difficulty reproducing the bug at first, but we
discovered that YouTube was sending us different contents each time. We
ran our code in a loop and captured several hundred deliveries of the Web
page in question until we got another instance that failed.
I've put a copy of the HTML that trips the bug up at
http://www.deepsky.com/~tuxedo/youtube-sgml-breaker.html
You can see the problem by loading closure-html and drakma and evaluating
this form:
(chtml:parse
(drakma:http-request
"http://www.deepsky.com/~tuxedo/youtube-sgml-breaker.html")
(chtml:make-lhtml-builder))
On SBCL 1.0.53, I'm getting this error:
Index 8192 out of bounds for (SIMPLE-ARRAY CHARACTER (8192)), should be
nonnegative and <8192.
[Condition of type SB-INT:INVALID-ARRAY-INDEX-ERROR]
The error is raised in SGML::READ-LITERAL. I only vaguely understand
what's going on in that function. I note that it's raising the error when
it's parsing the big block of flashvar-related stuff on line 244 of the
HTML file, and if I delete or add an extra character earlier in that line,
I can make the error go away. I infer that there's something happening in
the character decoding at the point where it needs to grow the buffer
that's making it lose, but I can't figure out just what it is.
Keith Browne
More information about the closure-devel
mailing list