[Ecls-list] Re: base-string patch

Sat May 20 07:47:05 UTC 2006

Brian Spilsbury dijo:
>> I hope I do not seem too picky. My point is the following: I do not like
>> the dispatch mechanism to be based on replacing functions and I want to
>> have the core lisp library available as directly callable C functions.
>>  The reason is efficiency: you do not want to go through the lisp
>> calling process for each character and at the same time you want the
>> functions in the core library to have all these standard functions
>> available for them.
>>
>> At the same time, Unicode prescribes that the 255 first characters are
>> for the Latin script. That would mean the effect of the lisp functions
>> should be predictable at least for base-char and base-string. That would
>> be really desirable because of performance issues: the core gives
>> predictable results on base strings and the case of base strings can be
>> optimized. But perhaps here I lay wrong with my assumption that base
>> strings are "easy".
>
> Unicode avoids the issue of language, delegating it to the application
> level. What I mean by script is 'a context for the interpretation of text
> attributes', which goes beyond unicode in that it incorporates
> language.

Yes, this I understand. But somehow Unicode already has a lot of
information about all recognized languages in their database. Properties
like downcase, uppercase or titlecase are there, as well as rules to
determine how to uppercase a character given its context in a string.

SBCL uses this information, though it drops out some stuff, like
context-sensitive ordering, string normalizations and character upcasing
when the output involves more than one character (like ß -> SS in German)
-- all this information is just from skimming the code and might be wrong.

Then several functions in the Common Lisp core library require string
upcasing and downcasing, as well as some kind of ordering. The character
case is defined for the Latin alphabet and it must be there for symbols to
work properly, as they are internally uppercased. The order is defined in
terms of the character code, and it should be reproducible.

How would it be then to do the following:

1) Provide an implementation of char-<, char-up/downcase, string-< and
string-up/downcase which is SBCL compatible. The rules should be
documented and based on the unicode database.

2) Provide some macrology to create packages which encapsulate versions of
the previous functions for other languages or particular needs. Naming
should be something like SCRIPT.LANGUAGE and they should all inherit from
SCRIPT.ECL, which reexports the default versions.

3) The core routines will use SCRIPT.ECL. This means that the behavoir of
EQUALP, EQUALP hashtables and interning of symbols will be hardcoded. This
does not seem too disturbing -- correct me if I am wrong.

4) We need to have functions to convert to and from the locale. It is
worth investingating the issues with pathnames: the operating system might
assume some locale for the strings that open, write, mkdir, etc use.

5) Optionally, one might define SCRIPT.DYNAMIC, where the previous
functions have dispatching versions that go to the package named in a
global variable, like *UNICODE-SCRIPT*.

I try to think about multithreading issues, but I see none.

Juanjo