[Bese-devel] UCW and Unicode
Marco Baringer
mb at bese.it
Wed Nov 9 14:14:23 UTC 2005
Jan Rychter <jan at rychter.com> writes:
> Marco:
>> Jan Rychter <jan at rychter.com> writes:
>> > I tried to use UCW in an application with Unicode content. It turns out
>> > that there are some problems.
>> >
>> > First, logging with loglevel +dribble+ won't work, as the logging stream
>> > isn't able to accept utf-8.
>>
>> is adding an :external-format parameter to the stream-log-appender
>> class enough? (this would then get passed to with-open-file)
>
> Not really, because things like image uploads will mess up the stream
> format anyway, resulting in an ugly crash. Basically, logging all
> received content isn't such a great idea.
no, probably not. it does help in debugging certain issues though. if
you tihnk of a decent compromise (something better than dying with
stream errors) i'd love to hear it.
>> > Second, <:as-html escapes non-ascii characters, which really isn't
>> > what you normally want.
>>
>> the html escaping is handled by the WRITE-AS-HTML function in arnesi,
>> can you suggest the changes we need? we already have code which always
>> uses write-char (as opposed to escaping) on unicode sbcl.
>
>> > Third (and perhaps this is a result of the above) textarea interface
>> > element escapes non-ascii input as well, so you get escaped
>> > characters in your lisp strings.
>
> Ok, I have to admit: I was wrong -- it isn't as-html's nor other
> widget's fault. These things are tough to debug. It turns out that it
> was really the browser that was doing the escaping, confusingly -- only
> sometimes.
>
> After fixing HTTP headers to include information about an utf-8 encoding
> things look much better. (BTW -- how do I tell a template-component
> about what content-type to use in HTTP headers? Apart from an render
> :around and using internal UCW functions?)
option 1 -
use a <meta http-equiv="Content-Type" content="text/html; charset=utf-8;"/>
option 2 -
grab the latest patch from ucw_dev
make your template component a subclass of window-component and add
a :content-type "text/html; charset=utf-8;" to the default initargs.
> So, I've gotten most things to work, except for one: uploads. These will
> not work, as there is a fundamental assumption within UCW that we can
> assume the stream format for an incoming request.
we don't strictly need to though. we now the http headers are 7 bit
ascii and so we can treat the request as a byte stream and do the
encoding our selves. if the content-type is
application/x-www-form-urlencoded then we again know that the data is
7 bit ascii (not that all browsers respect this) and we can convert it
ourselves, for multipart/form-data we can also do the right thing.
all of this (while something we really really should do) requires
hacking the various backends.
> Unfortunately, after we do a:
> (setf (external-format-for :http) :utf-8-unix)
>
> things will break, because browsers will send binary data in multiparts,
> which will be non-UTF-8-conforming and will break things.
>
> Solving this isn't obvious -- we'd need to parse multipart content using
> a "safe" stream format (byte-oriented) and then probably create other
> streams which are utf-8, with request parts. Or do away with the stream
> metaphor alltogether and just work on in-memory request data (we store
> all request data anyway in mod-lisp).
so the solution to this issue requires:
1) changing the various backends to use byte-streams and not character
streams. add encoding to/from strings where needed.
2) changing rfc2388 along the same lines.
not a trivial job, but definetly doable.
--
-Marco
Ring the bells that still can ring.
Forget the perfect offering.
There is a crack in everything.
That's how the light gets in.
-Leonard Cohen
More information about the bese-devel
mailing list