[Bese-devel] UCW and Unicode
Jan Rychter
jan at rychter.com
Wed Nov 9 15:11:30 UTC 2005
> Jan Rychter <jan at rychter.com> writes:
> > Marco:
> >> Jan Rychter <jan at rychter.com> writes:
> >> > I tried to use UCW in an application with Unicode content. It turns out
> >> > that there are some problems.
> >> >
> >> > First, logging with loglevel +dribble+ won't work, as the logging stream
> >> > isn't able to accept utf-8.
> >>
> >> is adding an :external-format parameter to the stream-log-appender
> >> class enough? (this would then get passed to with-open-file)
> >
> > Not really, because things like image uploads will mess up the stream
> > format anyway, resulting in an ugly crash. Basically, logging all
> > received content isn't such a great idea.
>
> no, probably not. it does help in debugging certain issues though. if
> you tihnk of a decent compromise (something better than dying with
> stream errors) i'd love to hear it.
I can't think of anything reasonable -- myself, I'd remove the content
logging and add it as required, when I actually debug something. It
isn't useful to someone who doesn't know UCW internals anyway.
> > After fixing HTTP headers to include information about an utf-8 encoding
> > things look much better. (BTW -- how do I tell a template-component
> > about what content-type to use in HTTP headers? Apart from an render
> > :around and using internal UCW functions?)
>
> option 1 -
> use a <meta http-equiv="Content-Type" content="text/html; charset=utf-8;"/>
>
> option 2 -
> grab the latest patch from ucw_dev
>
> make your template component a subclass of window-component and add
> a :content-type "text/html; charset=utf-8;" to the default initargs.
I choose... er... option 2.
> > So, I've gotten most things to work, except for one: uploads. These will
> > not work, as there is a fundamental assumption within UCW that we can
> > assume the stream format for an incoming request.
>
> we don't strictly need to though. we now the http headers are 7 bit
> ascii and so we can treat the request as a byte stream and do the
> encoding our selves. if the content-type is
> application/x-www-form-urlencoded then we again know that the data is
> 7 bit ascii (not that all browsers respect this) and we can convert it
> ourselves, for multipart/form-data we can also do the right thing.
>
> all of this (while something we really really should do) requires
> hacking the various backends.
>
> > Unfortunately, after we do a:
> > (setf (external-format-for :http) :utf-8-unix)
> >
> > things will break, because browsers will send binary data in multiparts,
> > which will be non-UTF-8-conforming and will break things.
> >
> > Solving this isn't obvious -- we'd need to parse multipart content using
> > a "safe" stream format (byte-oriented) and then probably create other
> > streams which are utf-8, with request parts. Or do away with the stream
> > metaphor alltogether and just work on in-memory request data (we store
> > all request data anyway in mod-lisp).
>
> so the solution to this issue requires:
>
> 1) changing the various backends to use byte-streams and not character
> streams. add encoding to/from strings where needed.
>
> 2) changing rfc2388 along the same lines.
>
> not a trivial job, but definetly doable.
Right.
I'm desperate enough to actually dive into this, but I feel I'm missing
something fundamental about CL -- namely, how to treat byte-data in
memory (say, in a vector) as a utf-8 character stream. I can't find a
way, and yet there should be one.
--J.
More information about the bese-devel
mailing list