[pro] CDR for UTF* and UNICODE handling in :element-type (Re: write-char vs. 8-bit bytes)

Sat Apr 12 04:12:51 UTC 2014

There is very little for a substandard to specify without overreaching (or
unnecessarily duplicating) other, more universal specifications.

Back when X3J13 spent a lot of time considering I18N, Unicode didn't yet
exist.  Unicode has been a big success in dealing with a very difficult
problem, and had existed, X3J13 probably would have specified it (if an
implementation chooses to support more than the set of basic chars) just as
Java did years later.  Further, UTF-8 didn't yet exist, but if it had,
implementations and perhaps even X3J13 would have adopted it as a default.
 But that's the history of a different universe, and there exist some
historical implementations that have character code points different from
Unicode.

Every Common Lisp implementation implements a character set and maps those
characters onto nonnegative integer code points.  That mapping is not
specified, and although Unicode (or perhaps just its intersection with
ASCII) would be the sane choice in modern times.  But this has nothing to
do with external formats.  Unicode does not define externalization formats
-- it defines _only_ the mapping between zillions of characters (most not
yet existent) into nonegative integer code points in the range of 21 bits.
 It can and does do this without the blessing of the Lisp community.

UTF-8 defines a mapping of Unicode code points onto a sequence of octets.
 It was originally defined to support the encoding of arbitrary 32-bit
nonnegative integers onto sequences of 1 to 6 octets, but it was
subsequently tied closer to Unicode in that it is defined to support on the
21-bit Unicode range, and also that certain code points (e.g. the surrogate
pairs) are defined to be errors.  (Much of this is explained understandably
on the Wikipedia UTF-8 page.)  So, UTF-8 is well defined and can work
without the blessing of the Lisp community.

So, if an implementation supports UTF-8 as an external format, it ought
translate whatever it uses for its internal code points into UTF-8 (which
represents, of course, Unicode code points).  Those internal code points
are not the business of any specification, and the UTF-8 translation is
already well defined by the Unicode and UTF-8 standards.

What's left?  Well, there is a little that could still be productively
substandardificated.  Specifically, the ANS punts nearly completely on what
can be used as the value of an :external-format argument.  So
quasi-portable code can't know what to specify if it wants to join the
modern computing community and read/write UTF-8.  I think the obvious
answer if to draft a substandard for a convention of :keyword names which
an implementation ought support for portability.  (Allegro does this, and
I'd be happy to provide a list of the many encodings and ef names that have
been  supported for decades.)  The most important one is of course :UTF-8,
but semistandardizing this along with the many ISO8859-nn encodings plus
the several traditional popular Japanese and Chinese encodings.  All these
encodings are rapidly falling out of usage, but there are historical web
pages and other sources that Common Lisp ought be able to internalize (for
those implementations that think this is important).

Other than external format naming, I can't think of anything that Common
Lisp needs to standardize.  Yes, the language would have been a better
programming-ecology citizen if code points were defined as Unicode, but
that would be back incompatible.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/pro/attachments/20140411/65da8c34/attachment.html>