[Bese-devel] UCW and Unicode

Marco Baringer mb at bese.it
Wed Nov 9 14:14:23 UTC 2005


Jan Rychter <jan at rychter.com> writes:

> Marco:
>> Jan Rychter <jan at rychter.com> writes:
>> > I tried to use UCW in an application with Unicode content. It turns out
>> > that there are some problems.
>> >
>> > First, logging with loglevel +dribble+ won't work, as the logging stream
>> > isn't able to accept utf-8. 
>> 
>> is adding an :external-format parameter to the stream-log-appender
>> class enough? (this would then get passed to with-open-file)
>
> Not really, because things like image uploads will mess up the stream
> format anyway, resulting in an ugly crash. Basically, logging all
> received content isn't such a great idea.

no, probably not. it does help in debugging certain issues though. if
you tihnk of a decent compromise (something better than dying with
stream errors) i'd love to hear it.

>> > Second, <:as-html escapes non-ascii characters, which really isn't
>> > what you normally want. 
>> 
>> the html escaping is handled by the WRITE-AS-HTML function in arnesi,
>> can you suggest the changes we need? we already have code which always
>> uses write-char (as opposed to escaping) on unicode sbcl.
>
>> > Third (and perhaps this is a result of the above) textarea interface
>> > element escapes non-ascii input as well, so you get escaped
>> > characters in your lisp strings.
>
> Ok, I have to admit: I was wrong -- it isn't as-html's nor other
> widget's fault. These things are tough to debug. It turns out that it
> was really the browser that was doing the escaping, confusingly -- only
> sometimes.
>
> After fixing HTTP headers to include information about an utf-8 encoding
> things look much better. (BTW -- how do I tell a template-component
> about what content-type to use in HTTP headers? Apart from an render
> :around and using internal UCW functions?)

option 1 -
   use a <meta http-equiv="Content-Type" content="text/html; charset=utf-8;"/>

option 2 -
   grab the latest patch from ucw_dev

   make your template component a subclass of window-component and add
   a :content-type "text/html; charset=utf-8;" to the default initargs.

> So, I've gotten most things to work, except for one: uploads. These will
> not work, as there is a fundamental assumption within UCW that we can
> assume the stream format for an incoming request.

we don't strictly need to though. we now the http headers are 7 bit
ascii and so we can treat the request as a byte stream and do the
encoding our selves. if the content-type is
application/x-www-form-urlencoded then we again know that the data is
7 bit ascii (not that all browsers respect this) and we can convert it
ourselves, for multipart/form-data we can also do the right thing.

all of this (while something we really really should do) requires
hacking the various backends.

> Unfortunately, after we do a:
>   (setf (external-format-for :http)  :utf-8-unix)      
>
> things will break, because browsers will send binary data in multiparts,
> which will be non-UTF-8-conforming and will break things.
>
> Solving this isn't obvious -- we'd need to parse multipart content using
> a "safe" stream format (byte-oriented) and then probably create other
> streams which are utf-8, with request parts. Or do away with the stream
> metaphor alltogether and just work on in-memory request data (we store
> all request data anyway in mod-lisp).

so the solution to this issue requires:

1) changing the various backends to use byte-streams and not character
   streams. add encoding to/from strings where needed.

2) changing rfc2388 along the same lines.

not a trivial job, but definetly doable.

-- 
-Marco
Ring the bells that still can ring.
Forget the perfect offering.
There is a crack in everything.
That's how the light gets in.
	-Leonard Cohen



More information about the bese-devel mailing list