[Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]
Matthew Mondor
mm_lists at pulsar-zone.net
Sat Feb 12 06:34:47 UTC 2011
On Sun, 30 Jan 2011 17:27:05 -0500
Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
> On Sun, 30 Jan 2011 23:25:21 +0100
> Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:
>
> > stream-{en,de}coding-error-stream
> > stream-{en,de}coding-error-external-format
> > stream-encoding-error-stream-code
> > stream-decoding-error-stream-octets
>
> Awesome, I'm looking forward to test the new code soon.
Today I tried the new interface. I had a few problems:
- The error signaled appears to be si::stream-decoding-error (internal
symbol)
- Somehow I get an error if trying to use the
stream-decoding-error-stream-octets accessor. I tried both
si:stream-decoding-error-stream-octets and
si::stream-decoding-error-stream-octets but get an unknown function
error. However, it appears that the parent
si:character-decoding-error-octets can be invoked;
- The octets provided by si:character-decoding-error-octet appear to
have been tempered with. I obtain two octets per LATIN-1 character
found in the stream, and these octets are quite lower than the actual
values (for the character 0xe9, the returned two octets were (9 99)
- Although I did see a USE-VALUE restart in the debugger, I could not
find the corresponding restart function to execute from a HANDLER-BIND
- There appears to be missing a sync restart allowing to continue
reading without having to supply a use-value. For instance the
reader function might want to add itself every illegal octets which
are > 127 to its input, treating them like LATIN-1 or mapping the
octets to a special unicode range, and there could be more than one
octet in the sequence, in which case use-value cannot be used (unless
it can accept multiple characters). SBCL uses the
sb-int:attempt-resync restart for that
I attach a test/example case:
- InvalidUTF8.txt contains both UTF-8 and ISO-8859-15 text.
- custom-read-line.lisp contains a custom function to read a line,
which attempts to convert to LATIN-1 the invalid UTF-8 sequences as
an example. It shows how that is done using SBCL as well.
Perhaps I can eventually take more time to suggest diffs; but I wanted
to follow-up first in case I missed something obvious.
Thanks,
--
Matt
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: InvalidUTF8.txt
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/e7e17e2f/attachment.txt>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: custom-read-line.lisp
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/e7e17e2f/attachment.ksh>
More information about the ecl-devel
mailing list