[Ecls-list] mapping invalid octet sequence to unicode private areas or others.

Mon Oct 21 13:46:48 UTC 2013

Matthew Mondor <mm_lists at pulsar-zone.net>
writes:

> On Mon, 21 Oct 2013 12:24:50 +0200
> "Pascal J. Bourguignon" <pjb at informatimago.com> wrote:
>
>> When reading utf-8 or other unicode streams, invalid byte sequences can
>> signal errors, be substituted by a given character, or be encoded into
>> application reseved code points to be able to transparently transmit the
>> invalid byte sequence.  Cf. clisp :INPUT-ERROR-ACTION parameter of
>> ext:make-encoding (clisp encodings are external-format values).
>> http://clisp.org/impnotes/encoding.html#make-encoding
>
> I agree with the above, and it's currently possible in ECL to handle
> UTF-8 decoding errors (ext:stream-decoding-error) with access to the
> octets of the invalid sequence (ext:character-decoding-error-octets),
> with an available restart (invoke-restart 'use-value ...).  Thus an
> application is free to also recode the invalid octets to LATIN or to
> implement "UTF8-B" at its discretion, if it implements its own input
> and output.
>
> The advantage of native modes such as UTF8-B or UTF8-LATIN-1 etc would
> be performance and simplicity in cases where this is wanted, but the
> default UTF-8 streams would continue to explicitely signal decoding
> errors, definitely.
>
> If you also mean that CLisp can also optionally do such conversions
> transparently on request (or that its interface allows user code to do
> this more efficiently), that's a good thing to know and I should look
> at its implementation for ideas on the way it presents that interface.
> I've added to my notes the link above, thanks a lot for your answer.

That's not the case, but it could be an additionnal option to the
:input-error-action parameter of make-encoding.  In clisp the current
options are:

    The :INPUT-ERROR-ACTION argument specifies what happens when an
    invalid byte sequence is encountered while converting bytes to
    characters. Its value can be :ERROR, :IGNORE or a character to be
    used instead. The UNICODE character #\uFFFD is typically used to
    indicate an error in the input sequence.

There are several unicode private use areas:
http://en.wikipedia.org/wiki/Private_Use_%28Unicode%29#Private_Use_Areas

So we could add `(:invalid-map ,first-character-or-code) as possible
value for :input-error-action, with the constraint that

 (<= 0 
     (typecase first-character-or-code
        (character      (char-code first-character-or-code))
        (unsigned-byte  first-character-or-code)
        (t              -1))
     (- char-code-limit 256))

(There could be some additionnal constraints on the range, if there were
 missing characters like in eg. ccl:

    clall -r '(length (loop for i below char-code-limit
                          unless (ignore-errors (code-char i)) collect i))'

    Armed Bear Common Lisp         --> 0
    Clozure Common Lisp            --> 2050
    CLISP                          --> 0
    CMU Common Lisp                --> 0
    ECL                            --> 0
    SBCL                           --> 0
).

To use private areas, the user could restrict first-character-or-code be
of type:

         `(or (integer #xe000   ,(- #xf8ff -1 256))
              (integer #xf0000  ,(- #xffffd -1 256))
              (integer #x100000 ,(- #x10fffd -1 256)))

but we can allow any character range.  0 would be useful to consider
invalid byte sequences as encoded in iso-8859-1.

An additionnal :output-invalid-map parameter would specify the same for
the reverse on output.

Finally, another option to :input-error-action could be to take a
(function (input-stream octet-vector) (values &optional (or character
string))) octet-vector being a vector of invalid bytes read so far from
the stream.  The function could further read bytes, and either build and
return a character or string, return no value (ignoring the read bytes),
or signal an error.

:output-invalid-map could take a 
(function (output-stream unencodable-character))
that could do whatever it wants to encode, replace or ignore the
unencodable-character on the output-stream.

So I guess :utf-8b could be defined as designing:

(make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0))

or equivalently:

(make-encoding :charset :utf-8
   :input-error-action (lambda (stream octet-vector)
                          (map 'string (function code-char) octet-vector)))

-- 
__Pascal Bourguignon__
http://www.informatimago.com/