[Ecls-list] Unicode: uncomfortable situation

Mon Jan 24 04:52:41 UTC 2011

On Sun, 23 Jan 2011 23:15:42 -0500
Matthew Mondor <mm_lists at pulsar-zone.net> wrote:

> On Sun, 23 Jan 2011 22:52:36 -0500
> Matthew Mondor <mm_lists at pulsar-zone.net> wrote:
> 
> > With ECL, the invalid sequence is already consumed when a more generic
> > error occurs (I forgot which, but could check my CVS logs on request),
> > which only allowed me to either ignore that invalid sequence or to
> > substitute it to an invalid unicode character (0x241a or 0xfffd).  If
> > the output must remain unmodified, there is no other way than to use
> > bytes or 8-bit clean characters at the moment.  At least a year ago or
> > more, we discussed this situation a bit on this list, yet I've not
> > looked into it again since.  Perhaps this would be a good time to
> > resume this work.
> 
> Attached is an example CUSTOM-READ-LINE showing the difference.

I guess that another possibility (which could offer a less complex and
more efficient interface than the SBCL way) would be for ECL to
automatically and transparently return invalid UTF-8 sequence octets as
remapped in an unassigned range, and also at UTF-8 output transparently
output those back characters of that range to the litteral octets,
while still letting the application potentially deal with that range of
"characters" as it wants, as long as it's documented...

Is anyone aware of the steps taken by other implementations than ECL or
SBCL when dealing with invalid UTF-8 sequences?  Perhaps another
implementation already does this, in which case ECL wouldn't be totally
unique if it chose that solution.

Thanks,
-- 
Matt