[Bese-devel] Re: character issues. aka: http is a binary protocol, get over it.

Maciek Pasternacki maciekp at japhy.fnord.org
Thu Dec 15 17:38:53 UTC 2005


On Prickle-Prickle, The Aftermath 57, 3171 YOLD, Marco Baringer wrote:

> what do various browsers send when the name of a field contains non
> ascii chars? (i ask only about the name because i'm completly ignoring
> how to handle the data for now). i'm interested in both GET and POST
> (with application/x-www-form-urlencoded and multipart/form-data
> encoding).
>
> I'm pretty sure that application/x-www-form-urlencoded (GET and
> regular POST) only allows latin-1 characters, but 1) i don't know what
> happens if we try to do it anyway, 2) there's always
> multipart/form-data which would allow it via the =?utf-16?Q?=00=DF?=
> syntax. this would unfortunetly require that rfc2388 know about
> character sets and encodings (which is something i'm trying hard to
> avoid).

As for latin-1 characters, practically it allows page to use any
octets, which are passed as `cookies' without meaning and charset (in
latin-1 all octets are legal, so when I do <input name="Здравствуйте">
in utf-8-encoded source, the input field name as far as RFC is
concerned will just be a Latin-1 string "ÐдÑавÑÑвÑйÑе".
Browsers will just send straight byte-to-byte copies of field name as
seen in HTML source and everything will be correct.

I would be more worried about charset in which browser would send
input field values, but it seems to be charset in which the page is
viewed; I didn't actually test sending text with characters outside
the page charset; I do now just use UTF-8 and don't worry.

What is wrong about my solution with treating stream as
iso-8859-1-encoded string (which is completely equivalent to binary
stream), and recoding it when I expect text to be in another charset?
On one hand it's a kind of hack, OTOH we work along the RFCs with
Latin-1 text, as RFCs state (and as is easier to debug than parsing
byte arrays), and after parsing, after all protocol-related work, we
re-encode Latin-1 text to encoding expected by us (or decode it to
byte arrays).  All encoding issues take place when they won't make
trouble, and when they start being actually relevant.  Analogically
with encoding reply to send out -- app works with Unicode text, when
it starts being encoded in any way, it's being re-coded to transparent
Latin-1 not to bother RFC-related code.

-- 
__    Maciek Pasternacki <maciekp at japhy.fnord.org> [ http://japhy.fnord.org/ ]
`| _   |_\  / { ...a good traveller has no fixed plans,
,|{-}|}| }\/                             and is not intent on arriving... }
\/   |____/                                                  ( Lao Tzu )  -><-




More information about the bese-devel mailing list