[armedbear-devel] Unicode support vs spec conformance

Mon Apr 5 17:10:22 UTC 2010

I believe compliance trumps everything else and that's how abcl distinguished itself from any other JVM based implementation of CL.

>From this point of view 1b seems the most reasonable.

There are two types of users:

- a programmer writing code, and he/she shouldn't be surprise by anything;

- a client of the programmer, and he/she shouldn't see any difference one way or the other.

> Date: Sun, 4 Apr 2010 14:38:40 +0200
> From: ehuels at gmail.com
> To: armedbear-devel at common-lisp.net
> Subject: [armedbear-devel] Unicode support vs spec conformance
> 
> This started with Douglas Miles remarking that 2 tests in the ansi
> test suite have been failing ever since we increased CHAR-CODE-LIMIT
> to #x10000.That change was associated with our intended support for
> Unicode.
> 
> As it turns out, Unicode has defined characters which conflict with a
> requirement from the CLHS: CLHS requires that characters be defined in
> exact pairs, if they have 'case'. This means that the functions
> char-upcase and char-downcase can be used to retrieve the character's
> "other case" equivalent.
> 
> However, in Unicode there are characters which don't have that
> property: for example LATIN SMALL LETTER DOTLESS I maps to the LATIN
> CAPITAL LETTER I, just as the LATIN SMALL LETTER I. Obviously, the
> capital can't map back to both. See here the issue with CLHS
> compliance emerge.
> 
> Three possible solutions have come up:
> 
> 1a. Be CLHS compliant, but not Unicode
> 1b. Same as (1a), but provide specific Unicode up/down casing functions
> 2. Be Unicode compliant and not CLHS.
> 
> SBCL (and from Sam Steingold's remark CLISP too) chooses 1a: I haven't
> found a function to do Unicode up/down casing; it defines the
> uppercase of the dotless i to be itself (caseless). This solution
> results in CLHS compliance, but in my opinion, isn't the solution with
> the least surprise: if you decide you want to upcase a string -
> without in-depth awareness of the issue - you're suddenly faced with a
> string which is upcased, except for a number of characters.
> 
> I would propose we - documentedly - diverge from the CLHS on this
> issue: we follow Unicode and the upper case version of the dotless i
> is just the capital i. From a user perspective, this seems like the
> solution of least surprise: I would expect people who use characters
> which can't be round-tripped to understand about that.
> 
> Sam suggests there's an issue with symbol i/o. He's probably referring
> to *readtable-case* and *print-case*. My reasoning here too is that
> people using characters like these in their symbols would expect to be
> familiar with both the casing behaviour of the Common Lisp
> reader/printer and the behaviour of their letters in such
> circumstances: if a string were uppercased in a certain way, wouldn't
> it be extremely weird if your symbols wouldn't too - given that your
> Common Lisp claims Unicode support?
> 
> 
> So, my proposal here is to diverge. What are your opinions?
> 
> 
> Bye,
> 
> 
> Erik.
> 
> _______________________________________________
> armedbear-devel mailing list
> armedbear-devel at common-lisp.net
> http://common-lisp.net/cgi-bin/mailman/listinfo/armedbear-devel

_________________________________________________________________
Hotmail has tools for the New Busy. Search, chat and e-mail from your inbox.
http://www.windowslive.com/campaign/thenewbusy?ocid=PID28326::T:WLMTAGL:ON:WL:en-US:WM_HMP:042010_1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/armedbear-devel/attachments/20100405/3993bdab/attachment.html>