[Ecls-list] Unicode: uncomfortable situation

Tue Jan 25 14:59:10 UTC 2011

On Mon, Jan 24, 2011 at 11:52 AM, Matthew Mondor
<mm_lists at pulsar-zone.net> wrote:

> I guess that another possibility (which could offer a less complex and
> more efficient interface than the SBCL way) would be for ECL to
> automatically and transparently return invalid UTF-8 sequence octets as
> remapped in an unassigned range, and also at UTF-8 output transparently
> output those back characters of that range to the litteral octets,
> while still letting the application potentially deal with that range of
> "characters" as it wants, as long as it's documented...

This sounds very much like the "UTF-8B" encoding frequently proposed
to deal with these problems, although I'm unable to locate a formal
specification of it on the web. Offhand, I'm not aware of any
implementation supporting UTF-8B as an external format (although Babel
can convert to/from it). This is unfortunate, as it would be the
preferred default external format for many uses (terminal IO,
directory listings, reading text files...) were it available. The
essential property of any such scheme is that decoding and subsequent
re-encoding of any random binary data be an identity function.

On the other hand, I've elsewhere taken to abusing Latin-1 to store
arbitrarily encoded text. Plain old (unsigned-byte 8) vectors would be
better still (and often more compact, due to certain implementation
choices regarding BASE-CHARs), but aren't convenient to mix with
character output on existing CL streams. In these applications, I'm
not certain I'd prefer UTF-8B even if it were available - the
decoding/encoding effort would be redundant, and I wouldn't tolerate
any measurable performance degradation when there's no user-visible
benefit. It's unfortunate that manipulating text in its encoded form
is incompatible with CL string semantics (for some encodings, anyway).
A modern text processing library for CL might prefer operating on
streams (wrapping raw encoded data) and concatenations of streams,
extending the sequence functions (as supported by SBCL) to manipulate
them, but that's drifting off topic.

For plain UTF-8 external formats, it sounds like ECL should support
condition/restart approach which other systems provide for handling
invalid byte sequences, but this is a very low level solution and
personally I'd hope to never find myself using it. Most real-world
data shouldn't be expected to satisfy a strict UTF-8 decoder, and it'd
be unfortunate if every program was forced to improvise its own
solution to the problem.