[Ecls-list] make-load-form patch

Thu May 18 05:10:03 UTC 2006

On Thu, 2006-05-18 at 19:19 +0900, Brian Spilsbury wrote:
> On 5/18/06, Juan Jose Garcia Ripoll <lisp at arrakis.es> wrote:
> > On Wed, 2006-05-17 at 11:31 +0900, Brian Spilsbury wrote:
> > > My plan is to add three additional stream types which use iconv for
> > > encoding translation, to extend the permitted range of character, and
> > > discriminate base-char from character by the value, then separate
> > > character and base-char strings.
> >
> > What would be the range for characters? I assume 16 bits is not enough
> > for chinese, is it? In any case, the 21-bit encoding of UTF-32 will
> > definitely fit in the cl_fixnum type.
> 
> There are 1,114,111 unicode code points in the full set, but the
> official (traditional and simplified) chinese characters all fit into
> the basic 16 bit set.
> If I read correctly, the current immediate character representation
> has 30 data bits and 2 tag bits, so that shouldn't require
> modification.

No. And I would also be happier if we keep a single character type and,
as you mention, separate BASE-CHAR and CHARACTER  only from the range.

> Additional string types will be necessary though. I'd suggest 8, and
> 24 bit elements, possibly adding 16 bit later.

24 bit looks problematic and it is not used in any library I know of. If
we use cl_fixnum arrays we can reuse the code for (ARRAY T *).

> > A more controversial issue is how to handle isalpha(), and similar
> > macros. This will depend on the size of the character type. For windows,
> > I think wchar_t is 16 bit while for Linux wchar_t is 32 bit and there
> > would be no problem translating the character to wchar_t and using the
> > library functions.
> 
> The library functions are problematic in that the structure of wchar_t
> is implementation and locale specific, but you could dispatch that
> way.

We need support for character comparisons, character case conversion and
detecting character type (i.e. alphanumeric or numeric). This could be
implemented as a set of library routines
	ecl_schar()
	ecl_alpha_char_p()
	ecl_alphanumericp()
	ecl_char_upcase()
	ecl_string_upcase()
	ecl_string_downcase()
	...
Having these basic routines, you can remove all references to string
data (such as string.self[...]) This need not be slow: conditionally on
the inclusion of Unicode support they can be redefined as macros that
use the C library routines.

I have been looking around for portable libraries that are small and
support Unicode. The C library itself depends on the locale for all
character operations and it probably cannot be used. The libunicode
library does depend on the C library (!?!) and does not qualify either.
iconv is limited to transformations between encodings, and does not
include character properties. Finally there is ICU and, though it is
heavy (8MB!), it contains everything and can be compiled as shared
library.

In the lisp world, CLISP uses its own C routines for Unicode but their
license is not acceptable for ECL. SBCL opted to write the unicode
database itself and do all the lookup and transformation in lisp. Note
that they use 32 and 8 bit characters internally. One may learn a lot
about how to use Unicode from lisp by looking at this implementation.

Regards,

Juanjo
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20060518/e435d022/attachment.sig>