[babel-devel] [Openmcl-devel] Changes

Sun Apr 12 20:35:38 UTC 2009

On Sun, 12 Apr 2009, Luís Oliveira wrote:

>
> Again, I'm curious how UTF-8B might be implemented when CODE-CHAR
> returns NIL for #xDC80 through #xDCFF.
>

Let's assume that we have something that reads a sequence of 1 or more
UTF-8-encoded bytes from a stream (and that we have variants that do
the same for bytes in foreign memory, a lisp vector, etc.)  If it
gets an EOF while trying to read the first byte of a sequence, it
returns NIL; otherwise, it returns an unsigned integer less than
#x110000.  If it can tell that a sequence is malformed (overlong,
whatever), it returns the CHAR-CODE of the Unicode replacement
character (#xfffd), it does not reject encoded values that correspond
to UTF-16 surrogate pairs or other non-character code points.

(defun read-utf-8-code (stream)
   "Try to read 1 or more octets from stream.  Return NIL if EOF is
    encountered when reading the first octet, otherwise, return an
    unsigned integer less than #x110000.  If a malformed UTF-8
    sequence is detected, return the character code of #\Replacement_Character;
    otherwise, return encoded value."
   (let* ((b0 (read-byte stream nil nil)))
     (when b0
       (if (< b0 #x80)
         b0
         (if (< b0 #xc2)
           (char-code #\Replacement_Character)
           (let* ((b1 (read-byte stream nil nil)))
             (if (null b1)
               ;;[Lots of other details to get right, not shown]
             )))))))

This (or something very much like it) has to exist in order to support
UTF-8; the elided details are surprisingly complicated (if we want
to reject malformed sequences.)

I wasn't able to find a formal definition of UTF-8B anywhere; the
informal descriptions that I saw suggested that it's a way of
embedding binary data in UTF-8-encoded character data, with the binary
data encoded in the low 8 bits of 16-bit codes whose high 8 bits
contained #xdc.  If the binary data is in fact embedded in the low 7
bits of codes in the range #xdc80-#xdc8f or something else, then the
following parameters would need to change:

(defparameter *utf-8b-binary-data-byte* (byte 8 0))
(defparameter *utf-8b-binary-marker-byte* (byte 13 8))
(defparameter *utf-8b-binary-marker-value* #xdc)

PROCESS-BINARY and PROCESS-CHARACTER do whatever it is that you
want to do with a byte of binary data or a character.  A real
decoder might want to take these functions - or a single function
that processed either a byte or character - as arguments.

This is just #\Replacement_Character in CCL:

(defparameter *replacement-character* (code-char #xfffd))

(defun decode-utf-8b-stream (stream)
   (do* ((code (read-utf-8-code stream) (read-utf-8-code stream)))
        ((null code)) ; eof
     (if (eql *utf-8b-binary-marker-value*
              (ldb *utf-8b-binary-marker-byte* code))
        (process-binary (ldb *utf-8b-binary-data-byte* code))
        (process-character (or (code-char code) *replacement-character*)))))

Isn't that the basic idea, whether the details/parameters are right
or not ?

> -- 
> Luís Oliveira
> http://student.dei.uc.pt/~lmoliv/
>
>