[armedbear-devel] Unicode support vs spec conformance

Erik Huelsmann ehuels at gmail.com
Sun Apr 4 12:38:40 UTC 2010


This started with Douglas Miles remarking that 2 tests in the ansi
test suite have been failing ever since we increased CHAR-CODE-LIMIT
to #x10000.That change was associated with our intended support for
Unicode.

As it turns out, Unicode has defined characters which conflict with a
requirement from the CLHS: CLHS requires that characters be defined in
exact pairs, if they have 'case'. This means that the functions
char-upcase and char-downcase can be used to retrieve the character's
"other case" equivalent.

However, in Unicode there are characters which don't have that
property: for example LATIN SMALL LETTER DOTLESS I maps to the LATIN
CAPITAL LETTER I, just as the LATIN SMALL LETTER I. Obviously, the
capital can't map back to both. See here the issue with CLHS
compliance emerge.

Three possible solutions have come up:

1a. Be CLHS compliant, but not Unicode
1b. Same as (1a), but provide specific Unicode up/down casing functions
2. Be Unicode compliant and not CLHS.

SBCL (and from Sam Steingold's remark CLISP too) chooses 1a: I haven't
found a function to do Unicode up/down casing; it defines the
uppercase of the dotless i to be itself (caseless). This solution
results in CLHS compliance, but in my opinion, isn't the solution with
the least surprise: if you decide you want to upcase a string -
without in-depth awareness of the issue - you're suddenly faced with a
string which is upcased, except for a number of characters.

I would propose we - documentedly - diverge from the CLHS on this
issue: we follow Unicode and the upper case version of the dotless i
is just the capital i. From a user perspective, this seems like the
solution of least surprise: I would expect people who use characters
which can't be round-tripped to understand about that.

Sam suggests there's an issue with symbol i/o. He's probably referring
to *readtable-case* and *print-case*. My reasoning here too is that
people using characters like these in their symbols would expect to be
familiar with both the casing behaviour of the Common Lisp
reader/printer and the behaviour of their letters in such
circumstances: if a string were uppercased in a certain way, wouldn't
it be extremely weird if your symbols wouldn't too - given that your
Common Lisp claims Unicode support?


So, my proposal here is to diverge. What are your opinions?


Bye,


Erik.




More information about the armedbear-devel mailing list