[Ecls-list] Unicode: uncomfortable situation

Mon Jan 24 14:01:30 UTC 2011

Matthew Mondor <mm_lists at pulsar-zone.net>
writes:

> On Sun, 23 Jan 2011 23:15:42 -0500
> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>
>> On Sun, 23 Jan 2011 22:52:36 -0500
>> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>> 
>> > With ECL, the invalid sequence is already consumed when a more generic
>> > error occurs (I forgot which, but could check my CVS logs on request),
>> > which only allowed me to either ignore that invalid sequence or to
>> > substitute it to an invalid unicode character (0x241a or 0xfffd).  If
>> > the output must remain unmodified, there is no other way than to use
>> > bytes or 8-bit clean characters at the moment.  At least a year ago or
>> > more, we discussed this situation a bit on this list, yet I've not
>> > looked into it again since.  Perhaps this would be a good time to
>> > resume this work.
>> 
>> Attached is an example CUSTOM-READ-LINE showing the difference.
>
> I guess that another possibility (which could offer a less complex and
> more efficient interface than the SBCL way) would be for ECL to
> automatically and transparently return invalid UTF-8 sequence octets as
> remapped in an unassigned range, and also at UTF-8 output transparently
> output those back characters of that range to the litteral octets,
> while still letting the application potentially deal with that range of
> "characters" as it wants, as long as it's documented...
>
> Is anyone aware of the steps taken by other implementations than ECL or
> SBCL when dealing with invalid UTF-8 sequences?  Perhaps another
> implementation already does this, in which case ECL wouldn't be totally
> unique if it chose that solution.

Clisp has a slot in its encoding structure to deal with input errors
(and also another for output errors).  You can specify to ignore the
wrong codes, to substitute them by a (unique) character, or to signal an
error.  The condition contains all the information, but in private
slots, and there's only a restart to continue reading the input, not to
give a substitute.

So the lesson would be:

1- give a public API to get access to the condition attributes.
2- provide featureful restarts.

http://clisp.sourceforge.net/impnotes/encoding.html#make-encoding

CL-USER> (with-open-file (text "/tmp/test.data" 
                           :external-format (ext:make-encoding
                                               :charset charset:utf-8 
                                               :input-error-action :error))
           (read-line text) (read-line text))

*** - READ-LINE: Invalid byte sequence #xE9 #x74 #xE9 in CHARSET:UTF-8 conversion
The following restarts are available:
RETRY          :R1      Retry SLIME REPL evaluation request.
PROCESS-INPUT  :R2      Continue reading input.
ABORT          :R3      Return to SLIME's top level.
CLOSE-CONNECTION :R4    Close SLIME connection.
ABORT          :R5      Abort main loop
C/Break 1 USER[8]> :i

#<EXT:SIMPLE-CHARSET-TYPE-ERROR #x000334353628>:  standard object
 type: EXT:SIMPLE-CHARSET-TYPE-ERROR
0 [$DATUM]:  #<ARRAY (UNSIGNED-BYTE 8) (3) #x0003343535C8>
1 [$EXPECTED-TYPE]:  #<ENCODING CHARSET:UTF-8 :UNIX>
2 [$FORMAT-CONTROL]:  
"~S: Invalid byte sequence #x~A~A #x~A~A #x~A~A in ~S conversion
"
3 [$FORMAT-ARGUMENTS]:  (READ-LINE #\E #\9 #\7 #\4 #\E #\9 CHARSET:UTF-8)
INSPECT-- type :h for help; :q to return to the REPL ---> 

-- 
__Pascal Bourguignon__                     http://www.informatimago.com/
A bad day in () is better than a good day in {}.