[babel-devel] [Openmcl-devel] Changes

Luís Oliveira luismbo at gmail.com
Sun Apr 12 21:46:19 UTC 2009


On Sun, Apr 12, 2009 at 9:35 PM, Gary Byers <gb at clozure.com> wrote:
> I wasn't able to find a formal definition of UTF-8B anywhere; the
> informal descriptions that I saw suggested that it's a way of
> embedding binary data in UTF-8-encoded character data, with the binary
> data encoded in the low 8 bits of 16-bit codes whose high 8 bits
> contained #xdc.

IIUC, UTF-8B is meant as a way of converting random bytes that are
*probably* in UTF-8 format into an Unicode string in such a way that
it's possible to reconstruct the original byte sequence later on.

The "spec" for UTF-8B is in this email message from Markus Kuhn:
  <http://mail.nl.linux.org/linux-utf8/2000-07/msg00040.html>
which I should have mentioned when I first brought this up. (Sorry, again.)


> (defun decode-utf-8b-stream (stream)
>  (do* ((code (read-utf-8-code stream) (read-utf-8-code stream)))
>       ((null code)) ; eof
>    (if (eql *utf-8b-binary-marker-value*
>             (ldb *utf-8b-binary-marker-byte* code))
>       (process-binary (ldb *utf-8b-binary-data-byte* code))
>       (process-character (or (code-char code) *replacement-character*)))))
>
> Isn't that the basic idea, whether the details/parameters are right
> or not ?

That would work. But it certainly seems much more convenient to use
Lisp strings directly. I'll try to illustrate that with a concrete
example.

These days, unix pathnames seem to be often encoded in UTF-8 but IIUC
they can really be any random sequence of bytes -- or at least that
seems to be the case on Linux.

Suppose I was implementing a directory browser in Lisp. If I could use
UTF-8B to convert unix pathnames into Lisp strings, it'd be
straightforward to use Lisp pathnames, pass them around, manipulate
them with the standard string and pathname functions, and still be
able to access the respective files through syscalls later on. In this
scenario, my program wouldn't have trouble handling badly formed UTF-8
or other binary junk.

The same applies to environment variables, command line arguments, and so on.

Does any of that make sense?

-- 
Luís Oliveira
http://student.dei.uc.pt/~lmoliv/




More information about the babel-devel mailing list