[babel-devel] Unicode issues, esp security

Mon Apr 13 21:55:46 UTC 2009

Dan Weinreb <dlw at itasoftware.com> writes:

> http://www.unicode.org/reports/tr36/

Thanks for that link.

> Cases like this, in which an illegal sequence is explicitly
> transformed into another illegal sequence, would meet with a lot of
> resistance from folks who care about security.

Assuming you're referring to UTF-8B, it should be pointed out (as James
already did) that it's not specified by Unicode and I would add that it
certainly isn't a general-purpose encoding.

James also points out that UTF-8B in fact follows the guidelines put
forward by TR36. Not that surprising since UTF-8B was, after all,
proposed by a Unicode expert.

> It's important not to do anything outside the definition.  Your
> objection to CODE-CHAR returning NIL is incompatible with the Unicode
> concept of "Noncharacters".  See the Unicode report section 16.7.

Well, that section says that the "Unicode Standard sets aside 66
noncharacter code points", and proceeds to specify them. CCL's CODE-CHAR
returns *non-NIL* for all of those codes -- at least in the oldish
version I have installed. A few comments about that:

    1. Though Gary has hinted that he would like CCL to return NIL for
       these codes, it's probably a good thing that CODE-CHAR currently
       returns non-NIL for noncharacters. In the next paragraph from
       that section, the standard says that "applications are free to
       use any of these noncharacter code points internally".

    2. Surrogate code points are not "noncharacters". The extra code
       points used by UTF-8B to represent invalid bytes are a subset of
       the surrogate code points. This distinction is probably not very
       useful, though.

-- 
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/