[Ecls-list] Re: base-string patch

Brian Spilsbury brian.spilsbury at gmail.com
Fri May 19 19:20:01 UTC 2006


> I hope I do not seem too picky. My point is the following: I do not like the dispatch mechanism to be based on replacing functions and I want to have the core lisp library available as directly callable C functions.
>  The reason is efficiency: you do not want to go through the lisp calling process for each character and at the same time you want the functions in the core library to have all these standard functions available for them.
>
> At the same time, Unicode prescribes that the 255 first characters are for the Latin script. That would mean the effect of the lisp functions should be predictable at least for base-char and base-string. That would be really desirable because of performance issues: the core gives predictable results on base strings and the case of base strings can be optimized. But perhaps here I lay wrong with my assumption that base strings are "easy".

Unicode avoids the issue of language, delegating it to the application level.

I haven't been clear about what I mean when I say script.

What I mean by script is 'a context for the interpretation of text
attributes', which goes beyond unicode in that it incorporates
language.

I'd suggest that we consider a 'common lisp' script to be the default,
which assumes standard CL conventions, and uses operators which
default arbitrarily for extended-chars.

> 1) Handling of scripts could be done, in my opinion, with an object that encapsulates all functions and which is set in a global variable, say *unicode-script*. This is more dynamical than function redefinition.

It is hard to efficiently encapsulate all functions when you don't
know what the set of functions will be in the future -- are you
suggesting using a hash-table, etc, for every extended-char predicate?

Another isssue is that the use-patterns are highly localized.

A similar problem is font-selection within a font-set in a CJK encoding.

Japanese fonts only encode a few thousand characters, and some of
these are locally simplified forms of traditional chinese characters.
The Japanese and Chinese characters for the same code-point are also
written with stylistic differences.

When we are rendering Japanese text, we need to prefer the Japanese
style font, except where it cannot render the character, in which case
we need to use a less preferable font.

There are two kinds of case where it cannot render the character.

The first is where there is a conceptual 'hole' in its character set.
For example, it does not have a particular Chinese character, but
we're not switching languages -- in this case, we just go and hunt for
this character as a special case, and maybe cache it.

The second is where there is a conceptual 'switch' between languages.
For example, it does not have any hangeul characters, and these are
disjoint from Chinese and Japanese characters. In this case it can
transfer control back to the seletor, which can then pick out a
hangeul font to take control. The hangeul font will then transfer
control back to the selector when it hits a non-hangeul character,
such as a chinese or japanese character.

This has two effects -- it makes it much more efficient to render in
the usual case, and it also allows for a reduction of the font
mangling problem -- the hangeul font can decide how to render roman
characters and arabic numerals.

The simple solution has some issues which need control sequences to
delimit, but that's not so relevant to our actual problem.

In our case, we lack the issue of style so it's easy for us to select
(for example) a vietnamese script which handles the capitalization of
diacritically marked characters as well as base latin characters.

> 2) If you want to have scripts redefine functions then I would vote for having a set of functions like ext:unicode-string-upcase, ext:unicode-alpha-num-p, etc, which are the ones that are redefined and which are called by the core library (string-upcase, etc) when the input is not a base-string or a base-char.

Certainly, except that I'd suggest script:string-upcase,
script:alpha-num-p, etc.

Since these functions need to handle the base-char cases too, then
users can just make a package which shadows
the cl versions if they want.

Unfortunately threading may complicate this scheme -- I'll have to
investigate the current model and think about it some more.

Regards,
Brian.




More information about the ecl-devel mailing list