[pro] [Q] unicode support

Nikodemus Siivola nikodemus at random-state.net
Sat Sep 29 12:28:34 UTC 2012


On 26 September 2012 20:23, Robert Smith <quad at symbo1ics.com> wrote:
> I think it might be worthwhile to look at unicode beyond just seeing
> if files can encoded as utf8.

> The concept of "unicode support" is pretty loaded. What does it mean?
> Does unicode support mean that one can operate on strings stored in a
> particular fashion? Does it mean functions like LENGTH handle
> overlaying characters correctly (e.g., any character plus a circumflex
> overlaying character... does that have length 1 or 2?)? Do the
> printers support stuff like right-to-left printing?

I think CL standard is pretty clear on what LENGTH does -- Unicode
doesn't come into it, /unless/ you happen to be on an implementation
that supports custom sequence types and defined one that understands
combining characters.

The only place where standard really hooks into Unicode is external
formats. Most (all?) of the tricky unicode stuff should IMO be
separate functions, instead of introducing subtleties to standard
ones.

I think some crucial questions are:

* What is CHAR-CODE-LIMIT?

* Are there holes in the char-code range?

* Which external formats are supported?

* Can strings contain arbitrary codepoints, or only things that
represent fully-fledged characters? (Can UTF-8b be supported?)

* Can users define new external formats?

* Are multiple line-ending conventions supported?

* BOM?

* Are the character names there?

* Is the unicode database the implementation needs to have anyways
accessible via a documented API?

* Is everything that should be O(1) O(1), or are some things O(N) with Unicode?

* Are there multiple string representations? (Eg. one for 0-255 range,
one for full code-char range.)

Cheers,

 -- Nikodemus




More information about the pro mailing list