[drakma-devel] Problem with Drakma and character encoding

Mathias Dahl mathias.dahl at gmail.com
Sat Jun 2 09:44:04 UTC 2007


Hi!

I have a problem with drakma and character encoding. My goal is to
make a small web utility which first GETs some content from a certain
page on my wiki, then optionally adds prepends stuff to this content,
and then POST the new content back. It works as long as only ASCII
characters are involved but fails when I use characters from the
higher part of Latin-1, in my case the Swedish character "ä".

Below I have simplified the code so that it just GETs the old content
and POST it back. What one sees is that the Swedish character is
correctly handled in the GET, I see the "ä" in it's full glory, but
after having posted it back, the content gets corrupt and the next
time I run the function I get an error because of the strange
character.

Ok, enough blabbing, here is the code:

;;;;; code starts here

(defvar *boundary* "-------------------------1852275791466338532535335716")

(defconstant +crlf+ #.(format nil "~C~C" #\Return #\Linefeed))

(defun format-field (name value)
  (format nil "--~a~aContent-Disposition: form-data; name=\"~a\"~a~a~a~a"
          *boundary* +crlf+ name +crlf+ +crlf+ value +crlf+))

(defun foo ()
  (let* ((old-content (drakma:http-request

"http://klibb.com/cgi-bin/wiki.pl?action=browse;id=2007-05-31;raw=1"))
         (cookie-jar (make-instance 'drakma:cookie-jar))
         (new-content (concatenate 'string
                                   (format-field "title" "2007-05-31")
                                   (format-field "text" old-content)
                                   (format-field "recent_edit" "on")
                                   (format-field "username" "MathiasDahl")
                                   "--" *boundary* "--" +crlf+)))
    (format t "Old content: ~a" old-content)
    (setf (drakma:cookie-jar-cookies cookie-jar)
          (list
           (make-instance 'drakma:cookie :name "pwd"
                                         :value "editeramera"
                                         :expires (+ (get-universal-time) 36000)
                                         :domain "klibb.com")))
    (format t "New content: ~a" new-content)
    (drakma:http-request
     "http://klibb.com/cgi-bin/wiki.pl"
     :method :post
     :cookie-jar cookie-jar
     :content-type (format nil "multipart/form-data; boundary=~a" *boundary*)
     :content new-content)))

;;;;; code ends here

Again, the code is simplified, some parts are hardcoded etc, but the
above is enough to recreate the problem. Note that after running the
code one time, you cannot test it again, because the content on the
page is now changed.

Here is what I get after running the function the first time:

====
* (foo)
Old content: blä
New content: ---------------------------1852275791466338532535335716
Content-Disposition: form-data; name="title"

2007-05-31
---------------------------1852275791466338532535335716
Content-Disposition: form-data; name="text"

blä

---------------------------1852275791466338532535335716
Content-Disposition: form-data; name="recent_edit"

on
---------------------------1852275791466338532535335716
Content-Disposition: form-data; name="username"

MathiasDahl
---------------------------1852275791466338532535335716--
NIL
302
((:DATE . "Sat, 02 Jun 2007 09:30:53 GMT")
 (:SERVER . "Apache/2.2.3 (Mandriva Linux/PREFORK-1mdv2007.0)")
 (:SET-COOKIE
  . "MuuWiki=username%1EMathiasDahl; path=/; expires=Mon, 01-Jun-2009
09:30:53 GMT")
 (:LOCATION . "http://klibb.com/cgi-bin/wiki.pl/2007-05-31")
 (:CONTENT-LENGTH . "0") (:CONNECTION . "close")
 (:CONTENT-TYPE . "application/x-perl"))
#<PURI:URI http://klibb.com/cgi-bin/wiki.pl>
#<FLEXI-STREAMS:FLEXI-IO-STREAM {C5728E1}>
T
====

As you can see, all looks well; the old content ("blä") looks like it
should, and the new content looks the same (it's the data in the form
field "text"). However, when I now run the function again, I get this:

====
* (foo)

debugger invoked on a FLEXI-STREAMS:FLEXI-STREAM-ENCODING-ERROR in
thread #<THREAD "initial thread" {AC14469}>:
  Unexpected value #xA in UTF-8 sequence.

Type HELP for debugger help, or (SB-EXT:QUIT) to exit from SBCL.

restarts (invokable by number or by possibly-abbreviated name):
  0: [USE-VALUE] Specify a character to be used instead.
  1: [ABORT    ] Exit debugger, returning to top level.

(FLEXI-STREAMS::SIGNAL-ENCODING-ERROR
 #<FLEXI-STREAMS:FLEXI-IO-STREAM {C5B37D1}>
 "Unexpected value #x~X in UTF-8 sequence."
 10)
====

It fails because that "ä" is now something else.

When I do the same thing from a browser, i.e. POST the page again and
again, I don't see any problems. I have done some network sniffing
with Wireshar and what I can see is that when the browser POSTs the
content, the "ä" is correctly encoded in UTF-8 as xC3 xA4. In the POST
done by drakma, the character is encoded xE4 (which IS the unicode
code point, but not encoded as UTF-8 if I understand things
correctly).

At first I tried to include the encoding in Content-Type, but when I
saw that it did not do any difference and also saw that Firefox does
not include this, I removed it. Oh, and I should show this as well:

* (sb-impl::default-external-format)

:UTF-8

Just so that we are clear that I DO see the content correctly and
UTF-8 is used.

I also tried with a version where I even hardcoded the content to be
sent to be "blä", and that gives the same problem. Maybe I should have
shortened the code above to that, but what I wanted to show was that
the same content I can GET nicely enough cannot be POSTed without
problems.

Any ideas on how I can continue debugging this? I feel kinda lost. It
feels frustrating to get stuck on a problem like this when I have got
the other logic to work, GETing and POSTing and stuff...

I am running this in SBCL 1.0 under Mandriva GNU/Linux.

Thanks!

/Mathias



More information about the Drakma-devel mailing list