[Ecls-list] mapping invalid octet sequence to unicode private areas or others.
mm_lists at pulsar-zone.net
Fri Oct 25 08:50:29 UTC 2013
On Mon, 21 Oct 2013 15:46:48 +0200
"Pascal J. Bourguignon" <pjb at informatimago.com> wrote:
> Matthew Mondor <mm_lists at pulsar-zone.net>
> > If you also mean that CLisp can also optionally do such conversions
> > transparently on request (or that its interface allows user code to do
> > this more efficiently), that's a good thing to know and I should look
> > at its implementation for ideas on the way it presents that interface.
> > I've added to my notes the link above, thanks a lot for your answer.
> That's not the case, but it could be an additionnal option to the
> :input-error-action parameter of make-encoding. In clisp the current
> options are:
> The :INPUT-ERROR-ACTION argument specifies what happens when an
> invalid byte sequence is encountered while converting bytes to
> characters. Its value can be :ERROR, :IGNORE or a character to be
> used instead. The UNICODE character #\uFFFD is typically used to
> indicate an error in the input sequence.
Sorry for not replying earlier, I wanted to make sure I had taken the
time to read the message properly.
In a way, CLisp options still seem good ideas, because it allows to
optionally ignore invalid sequences without the overhead of the
condition system when the code doesn't want to care...
> There are several unicode private use areas:
> So we could add `(:invalid-map ,first-character-or-code) as possible
> value for :input-error-action, with the constraint that
> (<= 0
> (typecase first-character-or-code
> (character (char-code first-character-or-code))
> (unsigned-byte first-character-or-code)
> (t -1))
> (- char-code-limit 256))
> (There could be some additionnal constraints on the range, if there were
> missing characters like in eg. ccl:
> clall -r '(length (loop for i below char-code-limit
> unless (ignore-errors (code-char i)) collect i))'
> Armed Bear Common Lisp --> 0
> Clozure Common Lisp --> 2050
> CLISP --> 0
> CMU Common Lisp --> 0
> ECL --> 0
> SBCL --> 0
> To use private areas, the user could restrict first-character-or-code be
> of type:
> `(or (integer #xe000 ,(- #xf8ff -1 256))
> (integer #xf0000 ,(- #xffffd -1 256))
> (integer #x100000 ,(- #x10fffd -1 256)))
> but we can allow any character range. 0 would be useful to consider
> invalid byte sequences as encoded in iso-8859-1.
> An additionnal :output-invalid-map parameter would specify the same for
> the reverse on output.
I will probably have to re-consult the above again, I'm keeping it in
my notes. If I remember, for "UTF-B" a surrogate UTF-16 range has
usually been used (#xDCxx) to store individual octets considered invalid
> Finally, another option to :input-error-action could be to take a
> (function (input-stream octet-vector) (values &optional (or character
> string))) octet-vector being a vector of invalid bytes read so far from
> the stream. The function could further read bytes, and either build and
> return a character or string, return no value (ignoring the read bytes),
> or signal an error.
> :output-invalid-map could take a
> (function (output-stream unencodable-character))
> that could do whatever it wants to encode, replace or ignore the
> unencodable-character on the output-stream.
> So I guess :utf-8b could be defined as designing:
> (make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0))
> or equivalently:
> (make-encoding :charset :utf-8
> :input-error-action (lambda (stream octet-vector)
> (map 'string (function code-char) octet-vector)))
This is interesting as it provides an alternative to conditions to
provide custom handling, while also allowing them for legacy code (or
code which structures better using them).
More information about the ecl-devel