[Ecls-list] mapping invalid octet sequence to unicode private areas or others.
Pascal J. Bourguignon
pjb at informatimago.com
Mon Oct 21 13:46:48 UTC 2013
Matthew Mondor <mm_lists at pulsar-zone.net>
> On Mon, 21 Oct 2013 12:24:50 +0200
> "Pascal J. Bourguignon" <pjb at informatimago.com> wrote:
>> When reading utf-8 or other unicode streams, invalid byte sequences can
>> signal errors, be substituted by a given character, or be encoded into
>> application reseved code points to be able to transparently transmit the
>> invalid byte sequence. Cf. clisp :INPUT-ERROR-ACTION parameter of
>> ext:make-encoding (clisp encodings are external-format values).
> I agree with the above, and it's currently possible in ECL to handle
> UTF-8 decoding errors (ext:stream-decoding-error) with access to the
> octets of the invalid sequence (ext:character-decoding-error-octets),
> with an available restart (invoke-restart 'use-value ...). Thus an
> application is free to also recode the invalid octets to LATIN or to
> implement "UTF8-B" at its discretion, if it implements its own input
> and output.
> The advantage of native modes such as UTF8-B or UTF8-LATIN-1 etc would
> be performance and simplicity in cases where this is wanted, but the
> default UTF-8 streams would continue to explicitely signal decoding
> errors, definitely.
> If you also mean that CLisp can also optionally do such conversions
> transparently on request (or that its interface allows user code to do
> this more efficiently), that's a good thing to know and I should look
> at its implementation for ideas on the way it presents that interface.
> I've added to my notes the link above, thanks a lot for your answer.
That's not the case, but it could be an additionnal option to the
:input-error-action parameter of make-encoding. In clisp the current
The :INPUT-ERROR-ACTION argument specifies what happens when an
invalid byte sequence is encountered while converting bytes to
characters. Its value can be :ERROR, :IGNORE or a character to be
used instead. The UNICODE character #\uFFFD is typically used to
indicate an error in the input sequence.
There are several unicode private use areas:
So we could add `(:invalid-map ,first-character-or-code) as possible
value for :input-error-action, with the constraint that
(character (char-code first-character-or-code))
(- char-code-limit 256))
(There could be some additionnal constraints on the range, if there were
missing characters like in eg. ccl:
clall -r '(length (loop for i below char-code-limit
unless (ignore-errors (code-char i)) collect i))'
Armed Bear Common Lisp --> 0
Clozure Common Lisp --> 2050
CLISP --> 0
CMU Common Lisp --> 0
ECL --> 0
SBCL --> 0
To use private areas, the user could restrict first-character-or-code be
`(or (integer #xe000 ,(- #xf8ff -1 256))
(integer #xf0000 ,(- #xffffd -1 256))
(integer #x100000 ,(- #x10fffd -1 256)))
but we can allow any character range. 0 would be useful to consider
invalid byte sequences as encoded in iso-8859-1.
An additionnal :output-invalid-map parameter would specify the same for
the reverse on output.
Finally, another option to :input-error-action could be to take a
(function (input-stream octet-vector) (values &optional (or character
string))) octet-vector being a vector of invalid bytes read so far from
the stream. The function could further read bytes, and either build and
return a character or string, return no value (ignoring the read bytes),
or signal an error.
:output-invalid-map could take a
(function (output-stream unencodable-character))
that could do whatever it wants to encode, replace or ignore the
unencodable-character on the output-stream.
So I guess :utf-8b could be defined as designing:
(make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0))
(make-encoding :charset :utf-8
:input-error-action (lambda (stream octet-vector)
(map 'string (function code-char) octet-vector)))
More information about the ecl-devel