[Bese-devel] UCW and Unicode

Jan Rychter jan at rychter.com
Wed Nov 9 15:11:30 UTC 2005


> Jan Rychter <jan at rychter.com> writes:
> > Marco:
> >> Jan Rychter <jan at rychter.com> writes:
> >> > I tried to use UCW in an application with Unicode content. It turns out
> >> > that there are some problems.
> >> >
> >> > First, logging with loglevel +dribble+ won't work, as the logging stream
> >> > isn't able to accept utf-8. 
> >> 
> >> is adding an :external-format parameter to the stream-log-appender
> >> class enough? (this would then get passed to with-open-file)
> >
> > Not really, because things like image uploads will mess up the stream
> > format anyway, resulting in an ugly crash. Basically, logging all
> > received content isn't such a great idea.
> 
> no, probably not. it does help in debugging certain issues though. if
> you tihnk of a decent compromise (something better than dying with
> stream errors) i'd love to hear it.

I can't think of anything reasonable -- myself, I'd remove the content
logging and add it as required, when I actually debug something. It
isn't useful to someone who doesn't know UCW internals anyway.

> > After fixing HTTP headers to include information about an utf-8 encoding
> > things look much better. (BTW -- how do I tell a template-component
> > about what content-type to use in HTTP headers? Apart from an render
> > :around and using internal UCW functions?)
> 
> option 1 -
>    use a <meta http-equiv="Content-Type" content="text/html; charset=utf-8;"/>
> 
> option 2 -
>    grab the latest patch from ucw_dev
> 
>    make your template component a subclass of window-component and add
>    a :content-type "text/html; charset=utf-8;" to the default initargs.

I choose... er... option 2.

> > So, I've gotten most things to work, except for one: uploads. These will
> > not work, as there is a fundamental assumption within UCW that we can
> > assume the stream format for an incoming request.
> 
> we don't strictly need to though. we now the http headers are 7 bit
> ascii and so we can treat the request as a byte stream and do the
> encoding our selves. if the content-type is
> application/x-www-form-urlencoded then we again know that the data is
> 7 bit ascii (not that all browsers respect this) and we can convert it
> ourselves, for multipart/form-data we can also do the right thing.
> 
> all of this (while something we really really should do) requires
> hacking the various backends.
> 
> > Unfortunately, after we do a:
> >   (setf (external-format-for :http)  :utf-8-unix)      
> >
> > things will break, because browsers will send binary data in multiparts,
> > which will be non-UTF-8-conforming and will break things.
> >
> > Solving this isn't obvious -- we'd need to parse multipart content using
> > a "safe" stream format (byte-oriented) and then probably create other
> > streams which are utf-8, with request parts. Or do away with the stream
> > metaphor alltogether and just work on in-memory request data (we store
> > all request data anyway in mod-lisp).
> 
> so the solution to this issue requires:
> 
> 1) changing the various backends to use byte-streams and not character
>    streams. add encoding to/from strings where needed.
> 
> 2) changing rfc2388 along the same lines.
> 
> not a trivial job, but definetly doable.

Right.

I'm desperate enough to actually dive into this, but I feel I'm missing
something fundamental about CL -- namely, how to treat byte-data in
memory (say, in a vector) as a utf-8 character stream. I can't find a
way, and yet there should be one.

--J.



More information about the bese-devel mailing list