[babel-devel] [Openmcl-devel] Changes
Gary Byers
gb at clozure.com
Sun Apr 12 20:35:38 UTC 2009
On Sun, 12 Apr 2009, Luís Oliveira wrote:
>
> Again, I'm curious how UTF-8B might be implemented when CODE-CHAR
> returns NIL for #xDC80 through #xDCFF.
>
Let's assume that we have something that reads a sequence of 1 or more
UTF-8-encoded bytes from a stream (and that we have variants that do
the same for bytes in foreign memory, a lisp vector, etc.) If it
gets an EOF while trying to read the first byte of a sequence, it
returns NIL; otherwise, it returns an unsigned integer less than
#x110000. If it can tell that a sequence is malformed (overlong,
whatever), it returns the CHAR-CODE of the Unicode replacement
character (#xfffd), it does not reject encoded values that correspond
to UTF-16 surrogate pairs or other non-character code points.
(defun read-utf-8-code (stream)
"Try to read 1 or more octets from stream. Return NIL if EOF is
encountered when reading the first octet, otherwise, return an
unsigned integer less than #x110000. If a malformed UTF-8
sequence is detected, return the character code of #\Replacement_Character;
otherwise, return encoded value."
(let* ((b0 (read-byte stream nil nil)))
(when b0
(if (< b0 #x80)
b0
(if (< b0 #xc2)
(char-code #\Replacement_Character)
(let* ((b1 (read-byte stream nil nil)))
(if (null b1)
;;[Lots of other details to get right, not shown]
)))))))
This (or something very much like it) has to exist in order to support
UTF-8; the elided details are surprisingly complicated (if we want
to reject malformed sequences.)
I wasn't able to find a formal definition of UTF-8B anywhere; the
informal descriptions that I saw suggested that it's a way of
embedding binary data in UTF-8-encoded character data, with the binary
data encoded in the low 8 bits of 16-bit codes whose high 8 bits
contained #xdc. If the binary data is in fact embedded in the low 7
bits of codes in the range #xdc80-#xdc8f or something else, then the
following parameters would need to change:
(defparameter *utf-8b-binary-data-byte* (byte 8 0))
(defparameter *utf-8b-binary-marker-byte* (byte 13 8))
(defparameter *utf-8b-binary-marker-value* #xdc)
PROCESS-BINARY and PROCESS-CHARACTER do whatever it is that you
want to do with a byte of binary data or a character. A real
decoder might want to take these functions - or a single function
that processed either a byte or character - as arguments.
This is just #\Replacement_Character in CCL:
(defparameter *replacement-character* (code-char #xfffd))
(defun decode-utf-8b-stream (stream)
(do* ((code (read-utf-8-code stream) (read-utf-8-code stream)))
((null code)) ; eof
(if (eql *utf-8b-binary-marker-value*
(ldb *utf-8b-binary-marker-byte* code))
(process-binary (ldb *utf-8b-binary-data-byte* code))
(process-character (or (code-char code) *replacement-character*)))))
Isn't that the basic idea, whether the details/parameters are right
or not ?
> --
> Luís Oliveira
> http://student.dei.uc.pt/~lmoliv/
>
>
More information about the babel-devel
mailing list