[Ecls-list] Unicode: uncomfortable situation

Mon Jan 24 15:20:02 UTC 2011

On 1/24/11 9:01 AM, Pascal J. Bourguignon wrote:
> Matthew Mondor <mm_lists at pulsar-zone.net>
> writes:
>
>> On Sun, 23 Jan 2011 23:15:42 -0500
>> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>>
>>> On Sun, 23 Jan 2011 22:52:36 -0500
>>> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>>>
>>>> With ECL, the invalid sequence is already consumed when a more generic
>>>> error occurs (I forgot which, but could check my CVS logs on request),
>>>> which only allowed me to either ignore that invalid sequence or to
>>>> substitute it to an invalid unicode character (0x241a or 0xfffd).  If
>>>> the output must remain unmodified, there is no other way than to use
>>>> bytes or 8-bit clean characters at the moment.  At least a year ago or
>>>> more, we discussed this situation a bit on this list, yet I've not
>>>> looked into it again since.  Perhaps this would be a good time to
>>>> resume this work.
>>> Attached is an example CUSTOM-READ-LINE showing the difference.
>> I guess that another possibility (which could offer a less complex and
>> more efficient interface than the SBCL way) would be for ECL to
>> automatically and transparently return invalid UTF-8 sequence octets as
>> remapped in an unassigned range, and also at UTF-8 output transparently
>> output those back characters of that range to the litteral octets,
>> while still letting the application potentially deal with that range of
>> "characters" as it wants, as long as it's documented...
>>
>> Is anyone aware of the steps taken by other implementations than ECL or
>> SBCL when dealing with invalid UTF-8 sequences?  Perhaps another
>> implementation already does this, in which case ECL wouldn't be totally
>> unique if it chose that solution.
> Clisp has a slot in its encoding structure to deal with input errors
> (and also another for output errors).  You can specify to ignore the
> wrong codes, to substitute them by a (unique) character, or to signal an
> error.  The condition contains all the information, but in private
> slots, and there's only a restart to continue reading the input, not to
> give a substitute.
>
>
> So the lesson would be:
>
> 1- give a public API to get access to the condition attributes.
> 2- provide featureful restarts.
>
>
>
>
> http://clisp.sourceforge.net/impnotes/encoding.html#make-encoding
>
>
>
> CL-USER> (with-open-file (text "/tmp/test.data" 
>                            :external-format (ext:make-encoding
>                                                :charset charset:utf-8 
>                                                :input-error-action :error))
>            (read-line text) (read-line text))
FWIW, CMUCL has something similar:

(with-open-file (s "/tmp/test.data" :external-format :utf8
:decoding-error t)
       (read-line s))

Error in function "DEFUN MAKE-FD-STREAM":
   Invalid utf8 octet #x74 at offset 1
   [Condition of type SIMPLE-ERROR]

Restarts:
 0: [CONTINUE] Use Unicode replacement character instead
 1: [RETRY] Retry SLIME REPL evaluation request.
 2: [*ABORT] Return to SLIME's top level.
 3: [ABORT] Return to Top-Level.

For :decoding-error, you can also specify a function to handle the
errors in the way that you want.

Ray