[ansi-test-devel] Unicode, CHAR-UPCASE/CHAR-DOWNCASE and char-upcase.1/char-upcase.2

Mon Apr 19 13:08:36 UTC 2010

On 4/15/10 5:18 PM, Erik Huelsmann wrote:
> On Thu, Apr 15, 2010 at 3:15 AM, Raymond Toy <toy.raymond at gmail.com> wrote:
>   
>> On 4/3/10 7:00 PM, Erik Huelsmann wrote:
>>     
>>> Ever since ABCL raised its CHAR-CODE-LIMIT from 256 to #x10000, 2
>>> tests started failing: char-upcase.1 and char-upcase.2.
>>>
>>> These 2 tests iterate through all integers between 0 and
>>> CHAR-CODE-LIMIT. While doing so, they test for the property that
>>> upcasing and downcasing returns the same character again
>>> ("round-tripping"). This property of characters is specified in
>>> section 13.1.4.3
>>> (http://www.lispworks.com/documentation/lw51/CLHS/Body/13_adc.htm)
>>> "Characters with case". In short: characters with case are defined in
>>> pairs; additional characters with case have to be defined in pairs
>>> too.
>>>
>>>       
>> But doesn't 13.1.4.3 also say characters with case are a subset of
>> alphabetic characters, and the glossary says alphabetic characters are
>> A-Z and a-z or any other implementation-defined character with case or
>> other graphic character defined by the implementation to be alphabetic.
>> So doesn't this mean the implementation can define the dotless-i
>> character as a non-alphabetic?  I guess that would also imply that
>> alpha-char-p return non-NIL for such characters.
>>     
> Right. You can do that, but then it can't have case anymore, meaning
> that CHAR-UPCASE should return the same value. Along the same lines of
>   
Why can't the character have case?  13.1.4.3.4 says an implementation
can define graphic characters with case.  But it also says there has to
be a one-to-one correspondence, so the round-tripping is required, no
matter what.

That seems to settle the issue.
> definition of STRING-UPCASE, that would mean that so should
> STRING-UPCASE...
>   
Right.  Except with cmucl, string-upcase uses char-upcase on each
character (taking into account the surrogate pairs).  But cmucl also
allows some options for upcasing according to simple or full Unicode casing.

For the record, here is some code that finds characters that don't round
trip on cmucl.  Perhaps this would be a useful test.

(loop for i from 0 below char-code-limit
  for c = (code-char i)
  when (and (lower-case-p c) (not (eql (char-downcase (char-upcase c))
                       c)))
  collect (list i (char-name c)))
->
((181 "Micro_Sign") (305 "Latin_Small_Letter_Dotless_I")
 (383 "Latin_Small_Letter_Long_S") (962 "Greek_Small_Letter_Final_Sigma")
 (976 "Greek_Beta_Symbol") (977 "Greek_Theta_Symbol") (981
"Greek_Phi_Symbol")
 (982 "Greek_Pi_Symbol") (1008 "Greek_Kappa_Symbol") (1009
"Greek_Rho_Symbol")
 (1013 "Greek_Lunate_Epsilon_Symbol")
 (7835 "Latin_Small_Letter_Long_S_With_Dot_Above")
 (8126 "Greek_Prosgegrammeni"))

(loop for i from 0 below char-code-limit
  for c = (code-char i)
  when (and (upper-case-p c) (not (eql (char-upcase (char-downcase c))
                       c)))
  collect (list i (char-name c)))
->
((304 "Latin_Capital_Letter_I_With_Dot_Above")
 (1012 "Greek_Capital_Theta_Symbol") (7838 "Latin_Capital_Letter_Sharp_S")
 (8486 "Ohm_Sign") (8490 "Kelvin_Sign") (8491 "Angstrom_Sign"))

Ray