[Ecls-list] Unicode

Brian Spilsbury brian.spilsbury at gmail.com
Thu May 18 07:48:01 UTC 2006


> detecting character type (i.e. alphanumeric or numeric). This could be
> implemented as a set of library routines
> 	ecl_schar()
> 	ecl_alpha_char_p()
> 	ecl_alphanumericp()
> 	ecl_char_upcase()
> 	ecl_string_upcase()
> 	ecl_string_downcase()
> 	...

One problem is that you have a great number of different scripts,
almost all of which you'll never use, and most of which have case
transforms, notions of digits and alphabetic characters.

Some of which will have operations which CL doesn't understand, such
as simplification, title-case, etc.

I strongly suggest that you don't try to solve this in one place.

Instead, think of a script as providing a character repertoire, and a
character repertoire as informing the above predicates (and others,
beside).

People rarely switch scripts within a document. This means that if you
remember the last script used, then it will probably be the correct
script to use, and if it isn't, then you can take a bit longer to look
up the one which is correct.

Once you do that, you can then defer the problem to the interested
parties, and they can produce appropriate implementations - be they in
C, CL or something else.

If the selection happens on a per-operator basis, then it is easy for
it to work with as-yet undefined operators, and one simple method
would be to simply replace the symbol-function slot with the current
script's implementation.

The only issue then would be something to locate the correct operator
for a character upon a cache miss, and that could be handled with a
stack of scripts in some preferential order.

One key point here is that a script is not a particular unicode range
-- it is just a set of characters which people tend to use together to
write in a given language -- a Japanese script would tend to include
roman characters and arabic numerals, as well as kanamajiri.

This is ignoring the unpleasant fact that vectors of characters are a
really bad way to represent text, which will reduce the utility of
text operations based on CL strings -- you cannot even do
string-upcase properly.

Anyhow, it should be reasonably easy to provide limited support for CL
strings, and more general support for some future 'text' type which
can do the job properly.

Regards,
Brian.




More information about the ecl-devel mailing list