[Ecls-list] mapping invalid octet sequence to unicode private areas or others.

Fri Oct 25 08:50:29 UTC 2013

On Mon, 21 Oct 2013 15:46:48 +0200
"Pascal J. Bourguignon" <pjb at informatimago.com> wrote:

> Matthew Mondor <mm_lists at pulsar-zone.net>
> writes:
> > If you also mean that CLisp can also optionally do such conversions
> > transparently on request (or that its interface allows user code to do
> > this more efficiently), that's a good thing to know and I should look
> > at its implementation for ideas on the way it presents that interface.
> > I've added to my notes the link above, thanks a lot for your answer.
> 
> That's not the case, but it could be an additionnal option to the
> :input-error-action parameter of make-encoding.  In clisp the current
> options are:
> 
>     The :INPUT-ERROR-ACTION argument specifies what happens when an
>     invalid byte sequence is encountered while converting bytes to
>     characters. Its value can be :ERROR, :IGNORE or a character to be
>     used instead. The UNICODE character #\uFFFD is typically used to
>     indicate an error in the input sequence.

Sorry for not replying earlier, I wanted to make sure I had taken the
time to read the message properly.

In a way, CLisp options still seem good ideas, because it allows to
optionally ignore invalid sequences without the overhead of the
condition system when the code doesn't want to care...

> There are several unicode private use areas:
> http://en.wikipedia.org/wiki/Private_Use_%28Unicode%29#Private_Use_Areas
> 
> So we could add `(:invalid-map ,first-character-or-code) as possible
> value for :input-error-action, with the constraint that
> 
>  (<= 0 
>      (typecase first-character-or-code
>         (character      (char-code first-character-or-code))
>         (unsigned-byte  first-character-or-code)
>         (t              -1))
>      (- char-code-limit 256))
> 
> 
> (There could be some additionnal constraints on the range, if there were
>  missing characters like in eg. ccl:
> 
>     clall -r '(length (loop for i below char-code-limit
>                           unless (ignore-errors (code-char i)) collect i))'
> 
>     Armed Bear Common Lisp         --> 0
>     Clozure Common Lisp            --> 2050
>     CLISP                          --> 0
>     CMU Common Lisp                --> 0
>     ECL                            --> 0
>     SBCL                           --> 0
> ).
> 
> To use private areas, the user could restrict first-character-or-code be
> of type:
> 
>          `(or (integer #xe000   ,(- #xf8ff -1 256))
>               (integer #xf0000  ,(- #xffffd -1 256))
>               (integer #x100000 ,(- #x10fffd -1 256)))
> 
> but we can allow any character range.  0 would be useful to consider
> invalid byte sequences as encoded in iso-8859-1.
> 
> 
> An additionnal :output-invalid-map parameter would specify the same for
> the reverse on output.

I will probably have to re-consult the above again, I'm keeping it in
my notes.  If I remember, for "UTF-B" a surrogate UTF-16 range has
usually been used (#xDCxx) to store individual octets considered invalid
UTF-8 sequences.

> Finally, another option to :input-error-action could be to take a
> (function (input-stream octet-vector) (values &optional (or character
> string))) octet-vector being a vector of invalid bytes read so far from
> the stream.  The function could further read bytes, and either build and
> return a character or string, return no value (ignoring the read bytes),
> or signal an error.
> 
> :output-invalid-map could take a 
> (function (output-stream unencodable-character))
> that could do whatever it wants to encode, replace or ignore the
> unencodable-character on the output-stream.
> 
> 
> 
> So I guess :utf-8b could be defined as designing:
> 
> (make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0))
> 
> or equivalently:
> 
> (make-encoding :charset :utf-8
>    :input-error-action (lambda (stream octet-vector)
>                           (map 'string (function code-char) octet-vector)))

This is interesting as it provides an alternative to conditions to
provide custom handling, while also allowing them for legacy code (or
code which structures better using them).
-- 
Matt