[Bese-devel] UCW and Unicode

Wed Nov 9 13:45:12 UTC 2005

Marco:
> Jan Rychter <jan at rychter.com> writes:
> > I tried to use UCW in an application with Unicode content. It turns out
> > that there are some problems.
> >
> > First, logging with loglevel +dribble+ won't work, as the logging stream
> > isn't able to accept utf-8. 
> 
> is adding an :external-format parameter to the stream-log-appender
> class enough? (this would then get passed to with-open-file)

Not really, because things like image uploads will mess up the stream
format anyway, resulting in an ugly crash. Basically, logging all
received content isn't such a great idea.

> > Second, <:as-html escapes non-ascii characters, which really isn't
> > what you normally want. 
> 
> the html escaping is handled by the WRITE-AS-HTML function in arnesi,
> can you suggest the changes we need? we already have code which always
> uses write-char (as opposed to escaping) on unicode sbcl.

> > Third (and perhaps this is a result of the above) textarea interface
> > element escapes non-ascii input as well, so you get escaped
> > characters in your lisp strings.

Ok, I have to admit: I was wrong -- it isn't as-html's nor other
widget's fault. These things are tough to debug. It turns out that it
was really the browser that was doing the escaping, confusingly -- only
sometimes.

After fixing HTTP headers to include information about an utf-8 encoding
things look much better. (BTW -- how do I tell a template-component
about what content-type to use in HTTP headers? Apart from an render
:around and using internal UCW functions?)

So, I've gotten most things to work, except for one: uploads. These will
not work, as there is a fundamental assumption within UCW that we can
assume the stream format for an incoming request.

Unfortunately, after we do a:
  (setf (external-format-for :http)  :utf-8-unix)      

things will break, because browsers will send binary data in multiparts,
which will be non-UTF-8-conforming and will break things.

Solving this isn't obvious -- we'd need to parse multipart content using
a "safe" stream format (byte-oriented) and then probably create other
streams which are utf-8, with request parts. Or do away with the stream
metaphor alltogether and just work on in-memory request data (we store
all request data anyway in mod-lisp).

--J.