[Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Wed Feb 16 08:10:59 UTC 2011

On Sun, 13 Feb 2011 04:41:27 -0500
Matthew Mondor <mm_lists at pulsar-zone.net> wrote:

> Yes I think that supporting that encoding would be very easy too.  The
> only possibly tricky part is for users of that encoding to as necessary
> output a more conventional utf-8 stream to some streams, such as for
> display, possibly with bad sequences converted to latin-1.  But it
> could read data from an UTF-8B exernal format stream and write it back
> to another UTF-8B stream and be sure that the original data was
> transparently copied as-is, and not be bothered with decoding/encoding
> errors on streams with that external format.
> 
> I'm not sure if ECL should itself treat those invalid octets
> transparently as LATIN-1 if doing the output on an UTF-8
> external-format stream, however.  It's possible that without this some
> problems occur in the debugger, slime, etc, which would be presented
> with invalid UTF-8 characters in the UTF-16 surrogate range.

So I had some time tonight and wanted to write a test implementation.

However, there indeed is a problem at decoding time because more than
one bytes might be invalid octets in a row in which case more than one
UTF-16 surrogates must be used to represent multiple litteral octets.

You said that the streams lacked push/pop buffers, and it seems that at
least a minimal one would be necessary to implement this (i.e. the
decoding_error() function would instead of signaling an error with the
octets, insert those into the push-buffer if the stream has an UTF-8B
external format).  The stream reading routine would then have to first
issue the contents of that buffer before processing more characters...

My first idea was to replace the ecl_read_byte8(stream, buffer+1,
nbytes) < nbytes) call by a one-byte reading one in the following loop,
but that would still lose bytes if the second, third, etc octet was
invalid, and could only really return a character for that last one.

Attached is the attempt, but it's by no means complete.
Thanks and good night,
-- 
Matt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: utf8b-partial-diff.txt
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110216/51868c8b/attachment.txt>