[Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Sun Feb 13 01:14:26 UTC 2011

On Sat, 12 Feb 2011 20:01:17 -0500
Matthew Mondor <mm_lists at pulsar-zone.net> wrote:

> On Sat, 12 Feb 2011 23:49:14 +0100
> Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:
> 
> > I did not realize that from your previous email. This is fixed now (trivial
> > typo in utf_8_decoder)
> 
> I tested and it works fine for invalid UTF-8 bytes to LATIN-1
> conversion.
> 
> I also did a test relating to my previous suggestions about a way to
> preserve intact invalid input at output, later refered to as "UTF-8B"
> by Andy Hefner previously, and it seems possible.
> 
> The remaining problems with UTF-8B are that it requires support by the
> UTF-8 encoder because those bytes should be output as-is.  Moreover,
> this may break things if the UTF-8 decoder does not transparently
> support this input conversion, because for instance, the implementation
> would otherwise not be able to read what it can write.  I got bitten by
> stuck slime several times when decoding/encoding errors occurred if I
> was not careful enough, outputting a stream with invalid characters in
> UTF-8 mode, it seems that slime could not catch the decoding error in
> that case (i.e. printing output of #xDCxx range characters without
> passing through the latin-1 conversion function) :)
> 
> But UTF-8B could in fact be considered another encoding, and if I
> really need it I might eventually send patches to have it optionally
> available.  The advantages of this mode have been previously mentionned
> on this list in the earlier "Unicode: uncomfortable situation" thread.
> Attached is the attempt nevertheless.  It works fine except for
> litteral-output.

Oh, it seems that UTF-8B is at least planned for SBCL too (not sure if
it already has it), according to
http://sbcl10.sbcl.org/materials/crhodes/unicode-lt.pdf
-- 
Matt