[Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Matthew Mondor mm_lists at pulsar-zone.net
Sun Feb 13 01:01:17 UTC 2011


On Sat, 12 Feb 2011 23:49:14 +0100
Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:

> I did not realize that from your previous email. This is fixed now (trivial
> typo in utf_8_decoder)

I tested and it works fine for invalid UTF-8 bytes to LATIN-1
conversion.

I also did a test relating to my previous suggestions about a way to
preserve intact invalid input at output, later refered to as "UTF-8B"
by Andy Hefner previously, and it seems possible.

The remaining problems with UTF-8B are that it requires support by the
UTF-8 encoder because those bytes should be output as-is.  Moreover,
this may break things if the UTF-8 decoder does not transparently
support this input conversion, because for instance, the implementation
would otherwise not be able to read what it can write.  I got bitten by
stuck slime several times when decoding/encoding errors occurred if I
was not careful enough, outputting a stream with invalid characters in
UTF-8 mode, it seems that slime could not catch the decoding error in
that case (i.e. printing output of #xDCxx range characters without
passing through the latin-1 conversion function) :)

But UTF-8B could in fact be considered another encoding, and if I
really need it I might eventually send patches to have it optionally
available.  The advantages of this mode have been previously mentionned
on this list in the earlier "Unicode: uncomfortable situation" thread.
Attached is the attempt nevertheless.  It works fine except for
litteral-output.

Thanks a lot, to me it seems that ECL is on par with SBCL for UTF-8
input handling now.
-- 
Matt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: custom-read-line.lisp
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/f157a86e/attachment.ksh>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: InvalidUTF8.txt
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/f157a86e/attachment.txt>


More information about the ecl-devel mailing list