[Ecls-list] Unicode: uncomfortable situation
Raymond Toy
toy.raymond at gmail.com
Mon Jan 24 15:20:02 UTC 2011
On 1/24/11 9:01 AM, Pascal J. Bourguignon wrote:
> Matthew Mondor <mm_lists at pulsar-zone.net>
> writes:
>
>> On Sun, 23 Jan 2011 23:15:42 -0500
>> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>>
>>> On Sun, 23 Jan 2011 22:52:36 -0500
>>> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>>>
>>>> With ECL, the invalid sequence is already consumed when a more generic
>>>> error occurs (I forgot which, but could check my CVS logs on request),
>>>> which only allowed me to either ignore that invalid sequence or to
>>>> substitute it to an invalid unicode character (0x241a or 0xfffd). If
>>>> the output must remain unmodified, there is no other way than to use
>>>> bytes or 8-bit clean characters at the moment. At least a year ago or
>>>> more, we discussed this situation a bit on this list, yet I've not
>>>> looked into it again since. Perhaps this would be a good time to
>>>> resume this work.
>>> Attached is an example CUSTOM-READ-LINE showing the difference.
>> I guess that another possibility (which could offer a less complex and
>> more efficient interface than the SBCL way) would be for ECL to
>> automatically and transparently return invalid UTF-8 sequence octets as
>> remapped in an unassigned range, and also at UTF-8 output transparently
>> output those back characters of that range to the litteral octets,
>> while still letting the application potentially deal with that range of
>> "characters" as it wants, as long as it's documented...
>>
>> Is anyone aware of the steps taken by other implementations than ECL or
>> SBCL when dealing with invalid UTF-8 sequences? Perhaps another
>> implementation already does this, in which case ECL wouldn't be totally
>> unique if it chose that solution.
> Clisp has a slot in its encoding structure to deal with input errors
> (and also another for output errors). You can specify to ignore the
> wrong codes, to substitute them by a (unique) character, or to signal an
> error. The condition contains all the information, but in private
> slots, and there's only a restart to continue reading the input, not to
> give a substitute.
>
>
> So the lesson would be:
>
> 1- give a public API to get access to the condition attributes.
> 2- provide featureful restarts.
>
>
>
>
> http://clisp.sourceforge.net/impnotes/encoding.html#make-encoding
>
>
>
> CL-USER> (with-open-file (text "/tmp/test.data"
> :external-format (ext:make-encoding
> :charset charset:utf-8
> :input-error-action :error))
> (read-line text) (read-line text))
FWIW, CMUCL has something similar:
(with-open-file (s "/tmp/test.data" :external-format :utf8
:decoding-error t)
(read-line s))
Error in function "DEFUN MAKE-FD-STREAM":
Invalid utf8 octet #x74 at offset 1
[Condition of type SIMPLE-ERROR]
Restarts:
0: [CONTINUE] Use Unicode replacement character instead
1: [RETRY] Retry SLIME REPL evaluation request.
2: [*ABORT] Return to SLIME's top level.
3: [ABORT] Return to Top-Level.
For :decoding-error, you can also specify a function to handle the
errors in the way that you want.
Ray
More information about the ecl-devel
mailing list