[Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Sat Feb 12 06:34:47 UTC 2011

On Sun, 30 Jan 2011 17:27:05 -0500
Matthew Mondor <mm_lists at pulsar-zone.net> wrote:

> On Sun, 30 Jan 2011 23:25:21 +0100
> Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:
> 
> > stream-{en,de}coding-error-stream
> > stream-{en,de}coding-error-external-format
> > stream-encoding-error-stream-code
> > stream-decoding-error-stream-octets
> 
> Awesome, I'm looking forward to test the new code soon.

Today I tried the new interface.  I had a few problems:

- The error signaled appears to be si::stream-decoding-error (internal
  symbol)
- Somehow I get an error if trying to use the
  stream-decoding-error-stream-octets accessor.  I tried both
  si:stream-decoding-error-stream-octets and
  si::stream-decoding-error-stream-octets but get an unknown function
  error.  However, it appears that the parent
  si:character-decoding-error-octets can be invoked;
- The octets provided by si:character-decoding-error-octet appear to
  have been tempered with.  I obtain two octets per LATIN-1 character
  found in the stream, and these octets are quite lower than the actual
  values (for the character 0xe9, the returned two octets were (9 99)
- Although I did see a USE-VALUE restart in the debugger, I could not
  find the corresponding restart function to execute from a HANDLER-BIND
- There appears to be missing a sync restart allowing to continue
  reading without having to supply a use-value.  For instance the
  reader function might want to add itself every illegal octets which
  are > 127 to its input, treating them like LATIN-1 or mapping the
  octets to a special unicode range, and there could be more than one
  octet in the sequence, in which case use-value cannot be used (unless
  it can accept multiple characters).  SBCL uses the
  sb-int:attempt-resync restart for that

I attach a test/example case:
- InvalidUTF8.txt contains both UTF-8 and ISO-8859-15 text.
- custom-read-line.lisp contains a custom function to read a line,
  which attempts to convert to LATIN-1 the invalid UTF-8 sequences as
  an example.  It shows how that is done using SBCL as well.

Perhaps I can eventually take more time to suggest diffs; but I wanted
to follow-up first in case I missed something obvious.

Thanks,
-- 
Matt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: InvalidUTF8.txt
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/e7e17e2f/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: custom-read-line.lisp
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/e7e17e2f/attachment.ksh>