[hunchentoot-devel] Charset assumptions, in particular in POST bodies

Sat Jun 5 18:57:45 UTC 2010

About 6 months ago I got some strange encoding errors with a
Hunchentoot web server.  There are a few of places in Hunchentoot
where the +latin-1+ character encoding is used as the external format
regardless of headers received from the client:

- GET-POST-DATA returns a +latin-1+ externally encoded stream no
matter what when the WANT-STREAM parameter is true.
- PARSE-MULTIPART-FORM-DATA creates a +latin-1+ stream from the
CONTENT-STREAM of the request.  (relevant RFC: 2388)
- MAYBE-READ-POST-PARAMETERS uses +latin-1+ to process
"application/x-www-form-urlencoded" content-type POST bodies

In addition, RECOMPUTE-REQUEST-PARAMETERS seems to interpret both the
message body and the query string according to a charset in the
request header.  I thought that Content-Type was only supposed to
affect the message body, not the headers (which are assumed to be in
ASCII).  Then shouldn't the URL and query string always be read as
ASCII?  RFC2047 discusses non-ascii headers for MIME, but I don't know
if that is relevant except for parsing multipart forms.

I'm not thoroughly versed in the HTTP protocol, but it seems that
these are bugs in Hunchentoot.  I have a half-completed patch but I
want to get some more opinions before I go any further.  There may
also be other lurking encoding issues in Hunchentoot, or I may be
entirely mistaken.

Proposed solution:
- GET-POST-DATA, PARSE-MULTIPART-FORM-DATA, and
MAYBE-READ-POST-PARAMETERS should respect the Content-Type header in
the request and use that to define the external-format of the stream
used to parse
- RECOMPUTE-REQUEST-PARAMETERS should only use the Content-Type
external format to parse the post parameters
- PARSE-MULTIPART-FORM-DATA may need additional review to be in
accordance with RFC2047 and RFC2388

Feedback, please.

Thanks,
Red