[babel-devel] Changes

Tue Apr 14 15:14:40 UTC 2009

On Tue, Apr 14, 2009 at 10:17 AM, David Lichteblau <david at lichteblau.com> wrote:
> In Allegro CL and LispWorks, the situation is very different.  They use
> UTF-16 to represent Lisp strings in memory, so surrogates aren't just
> forbidden in Lisp strings, user code actually needs to work with
> surrogates to be able to use all of Unicode.

My understanding was that they use UCS-2, i.e., they are limited to
the BMP. AFAICT, their external formats don't produce surrogates in
Lisp strings. (ACL doesn't. Didn't test Lispworks but its
documentation specifically mentions the BMP and UCS-2.) They don't
seem to have functions to deal with surrogates either.

> SBCL has 21 bit characters like CCL and currently has characters for the
> surrogate code points.  But I am not aware of any consensus that this is
> the right thing to do.  Personally I think it's a bug and SBCL should be
> changed to do it like CCL.

I don't feel that strongly about either option. (I feel it's more
important that the various Lisps agree on one of them.) That said, so
far I haven't heard any compelling arguments in favor of having
CODE-CHAR return NIL for such code points.

> As far as I understand, the only Lisp with 21 bit characters whose
> author thinks that SBCL's behaviour is correct is ECL, but I failed to
> understand the reasoning behind that when it was being discussed
> on comp.lang.lisp.

[Assuming you're referring to the "Unicode and Common Lisp" thread.
<ab100561-286c-4570-aabc-72fd877f22ae at v18g2000pro.googlegroups.com>
<http://groups.google.com/group/comp.lang.lisp/browse_thread/thread/97ff103aee76ada2>]

I couldn't find where Juanjo argued for or against CODE-CHAR returning
NIL for surrogates. His (somewhat unrelated) main point, IIUC, is that
you shouldn't try to support full Unicode when all you have is 16-bit
characters and instead restrict Unicode handling to the BMP. (U+0000
through U+FFFF.)

> (As a side note, I find it a huge hassle to write code portable between
> the Lisp implementations with Unicode support.  For CXML, I needed read
> time conditionals checking for UTF-16 Lisps.  And it still doesn't
> actually work, because most of the other free libraries like Babel,
> CL-Unicode, and in turn, CL-PPCRE, expect 21 bit characters and are
> effectively broken on Allegro and LispWorks.)

I would argue that you're trying too hard. Would you support UTF-8 on
CMUCL too? (Yes, I am aware that Unicode support for CMUCL is eminent.
Using actual UTF-16... With surrogates... *sigh*) Just ignore/replace
code points equal to or greater than CHAR-CODE-LIMIT.

Does that mean CXML won't pass the test suites for Allegro and
Lispworks? So be it. If enough Allegro or Lispworks costumers have the
need to deal with characters outside the BMP, they'll complain and
it'll be fixed. Duane Rettig hints in that c.l.l thread that Allegro
might support full 21-bit Unicode characters in the future. (But
perhaps I'm being too optimistic. Perhaps you really really need to
use CXML+Lispworks/Allegro along with characters outside the BMP? Then
ignore what I said above, I guess.)

My plan for Babel was not to assume 21-bit characters, but to punt on
characters above the CHAR-CODE-LIMIT. (It doesn't do that properly at
the moment. That's definitely a bug that will have to be fixed.)

Would you argue that it'd be better for Babel to instead use UTF-16 on
Lispworks/Allegro? (Not a rhetorical question; if that turns out to be
a good idea I'd change Babel in that direction.) What about UTF-8 for
Lisps with 8-bit characters? I suspect that restricting oneself to a
subset of Unicode is more robust and more manageable for portable
programs.

> While I have no ideas regarding UTF-8b, I think it worth pointing out
> that for the important use case of file names, there is a different way of
> achieving a round-trip involving file names in "might be UTF-8" format.
>
> The idea is to interpret invalid UTF-8 bytes in Latin 1, but prefix them
> with the code point 0.
>
> On encoding back to a file name, such null characters would be stripped
> again.

That sounds like worse a hack than UTF-8B because if you convert such
a string into another encoding you'll get bogus characters with no
indication of error instead of, say, replacement characters. (That
seems to be a big advantage of representing invalid bytes as invalid
characters. Doesn't that make sense?)

-- 
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/