[Ecls-list] mapping invalid octet sequence to unicode private areas or others.
Matthew Mondor
mm_lists at pulsar-zone.net
Fri Oct 25 08:50:29 UTC 2013
On Mon, 21 Oct 2013 15:46:48 +0200
"Pascal J. Bourguignon" <pjb at informatimago.com> wrote:
> Matthew Mondor <mm_lists at pulsar-zone.net>
> writes:
> > If you also mean that CLisp can also optionally do such conversions
> > transparently on request (or that its interface allows user code to do
> > this more efficiently), that's a good thing to know and I should look
> > at its implementation for ideas on the way it presents that interface.
> > I've added to my notes the link above, thanks a lot for your answer.
>
> That's not the case, but it could be an additionnal option to the
> :input-error-action parameter of make-encoding. In clisp the current
> options are:
>
> The :INPUT-ERROR-ACTION argument specifies what happens when an
> invalid byte sequence is encountered while converting bytes to
> characters. Its value can be :ERROR, :IGNORE or a character to be
> used instead. The UNICODE character #\uFFFD is typically used to
> indicate an error in the input sequence.
Sorry for not replying earlier, I wanted to make sure I had taken the
time to read the message properly.
In a way, CLisp options still seem good ideas, because it allows to
optionally ignore invalid sequences without the overhead of the
condition system when the code doesn't want to care...
> There are several unicode private use areas:
> http://en.wikipedia.org/wiki/Private_Use_%28Unicode%29#Private_Use_Areas
>
> So we could add `(:invalid-map ,first-character-or-code) as possible
> value for :input-error-action, with the constraint that
>
> (<= 0
> (typecase first-character-or-code
> (character (char-code first-character-or-code))
> (unsigned-byte first-character-or-code)
> (t -1))
> (- char-code-limit 256))
>
>
> (There could be some additionnal constraints on the range, if there were
> missing characters like in eg. ccl:
>
> clall -r '(length (loop for i below char-code-limit
> unless (ignore-errors (code-char i)) collect i))'
>
> Armed Bear Common Lisp --> 0
> Clozure Common Lisp --> 2050
> CLISP --> 0
> CMU Common Lisp --> 0
> ECL --> 0
> SBCL --> 0
> ).
>
> To use private areas, the user could restrict first-character-or-code be
> of type:
>
> `(or (integer #xe000 ,(- #xf8ff -1 256))
> (integer #xf0000 ,(- #xffffd -1 256))
> (integer #x100000 ,(- #x10fffd -1 256)))
>
> but we can allow any character range. 0 would be useful to consider
> invalid byte sequences as encoded in iso-8859-1.
>
>
> An additionnal :output-invalid-map parameter would specify the same for
> the reverse on output.
I will probably have to re-consult the above again, I'm keeping it in
my notes. If I remember, for "UTF-B" a surrogate UTF-16 range has
usually been used (#xDCxx) to store individual octets considered invalid
UTF-8 sequences.
> Finally, another option to :input-error-action could be to take a
> (function (input-stream octet-vector) (values &optional (or character
> string))) octet-vector being a vector of invalid bytes read so far from
> the stream. The function could further read bytes, and either build and
> return a character or string, return no value (ignoring the read bytes),
> or signal an error.
>
> :output-invalid-map could take a
> (function (output-stream unencodable-character))
> that could do whatever it wants to encode, replace or ignore the
> unencodable-character on the output-stream.
>
>
>
> So I guess :utf-8b could be defined as designing:
>
> (make-encoding :charset :utf-8 :input-error-action '(:invalid-map 0))
>
> or equivalently:
>
> (make-encoding :charset :utf-8
> :input-error-action (lambda (stream octet-vector)
> (map 'string (function code-char) octet-vector)))
This is interesting as it provides an alternative to conditions to
provide custom handling, while also allowing them for legacy code (or
code which structures better using them).
--
Matt
More information about the ecl-devel
mailing list