[Ecls-list] Unicode: uncomfortable situation
Pascal J. Bourguignon
pjb at informatimago.com
Mon Jan 24 14:01:30 UTC 2011
Matthew Mondor <mm_lists at pulsar-zone.net>
writes:
> On Sun, 23 Jan 2011 23:15:42 -0500
> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>
>> On Sun, 23 Jan 2011 22:52:36 -0500
>> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
>>
>> > With ECL, the invalid sequence is already consumed when a more generic
>> > error occurs (I forgot which, but could check my CVS logs on request),
>> > which only allowed me to either ignore that invalid sequence or to
>> > substitute it to an invalid unicode character (0x241a or 0xfffd). If
>> > the output must remain unmodified, there is no other way than to use
>> > bytes or 8-bit clean characters at the moment. At least a year ago or
>> > more, we discussed this situation a bit on this list, yet I've not
>> > looked into it again since. Perhaps this would be a good time to
>> > resume this work.
>>
>> Attached is an example CUSTOM-READ-LINE showing the difference.
>
> I guess that another possibility (which could offer a less complex and
> more efficient interface than the SBCL way) would be for ECL to
> automatically and transparently return invalid UTF-8 sequence octets as
> remapped in an unassigned range, and also at UTF-8 output transparently
> output those back characters of that range to the litteral octets,
> while still letting the application potentially deal with that range of
> "characters" as it wants, as long as it's documented...
>
> Is anyone aware of the steps taken by other implementations than ECL or
> SBCL when dealing with invalid UTF-8 sequences? Perhaps another
> implementation already does this, in which case ECL wouldn't be totally
> unique if it chose that solution.
Clisp has a slot in its encoding structure to deal with input errors
(and also another for output errors). You can specify to ignore the
wrong codes, to substitute them by a (unique) character, or to signal an
error. The condition contains all the information, but in private
slots, and there's only a restart to continue reading the input, not to
give a substitute.
So the lesson would be:
1- give a public API to get access to the condition attributes.
2- provide featureful restarts.
http://clisp.sourceforge.net/impnotes/encoding.html#make-encoding
CL-USER> (with-open-file (text "/tmp/test.data"
:external-format (ext:make-encoding
:charset charset:utf-8
:input-error-action :error))
(read-line text) (read-line text))
*** - READ-LINE: Invalid byte sequence #xE9 #x74 #xE9 in CHARSET:UTF-8 conversion
The following restarts are available:
RETRY :R1 Retry SLIME REPL evaluation request.
PROCESS-INPUT :R2 Continue reading input.
ABORT :R3 Return to SLIME's top level.
CLOSE-CONNECTION :R4 Close SLIME connection.
ABORT :R5 Abort main loop
C/Break 1 USER[8]> :i
#<EXT:SIMPLE-CHARSET-TYPE-ERROR #x000334353628>: standard object
type: EXT:SIMPLE-CHARSET-TYPE-ERROR
0 [$DATUM]: #<ARRAY (UNSIGNED-BYTE 8) (3) #x0003343535C8>
1 [$EXPECTED-TYPE]: #<ENCODING CHARSET:UTF-8 :UNIX>
2 [$FORMAT-CONTROL]:
"~S: Invalid byte sequence #x~A~A #x~A~A #x~A~A in ~S conversion
"
3 [$FORMAT-ARGUMENTS]: (READ-LINE #\E #\9 #\7 #\4 #\E #\9 CHARSET:UTF-8)
INSPECT-- type :h for help; :q to return to the REPL --->
--
__Pascal Bourguignon__ http://www.informatimago.com/
A bad day in () is better than a good day in {}.
More information about the ecl-devel
mailing list