[Bese-devel] Re: character issues. aka: http is a binary protocol, get over it.

Thu Dec 15 23:07:44 UTC 2005

On Thu, Dec 15, 2005 at 07:18:57PM +0100, Marco Baringer wrote:

>> in utf-8-encoded source, the input field name as far as RFC is
>> concerned will just be a Latin-1 string "????´????°????????????????????????".
>> Browsers will just send straight byte-to-byte copies of field name as
>> seen in HTML source and everything will be correct.
>
>this is good to know. however, what happens when one of the utf-8
>encoded characters, when viewed as a byte sequence, conflicts with the
>standard application/x-www-form-urlencoded markers? the utf-8 sequence
>0626 (arabic yeh with hamza) would parse as the control sequence 06
>plus a #\& character, this would confuse my current parser greatly.

0626 is Unicode character number. These are encoded in UTF-8 to sequence of
bytes with eighth bit set (greater than 127), so it isn't a problem.  It could
be a problem if browser sent us binary (i.e. Content-Transfer-Encoding: 8bit,
not Base64 or Quoted-Printable) stream encoded in UCS-2 or UCS-4, but I haven't
seen such perversion in my life, and I don't think we should be worried about
it.

-- 
-><- This signature intentionally left blank. -><-