[Ecls-list] Sequence streams

Mon Aug 29 06:13:26 UTC 2011

On Sun, 28 Aug 2011 13:25:58 +0200
Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:

> I think we have two different models in mind. This is how I see it

That's possible, or some minor misunderstanding; I'll read the
following carefully.

> * READ/WRITE-BYTE do not have external formats. period. They are binary I-O
> functions and the only customizable thing they have is the word size and the
> endinanness, but they do not know about characters.

This makes sense to me as well.

> * Binary sequence streams are streams built on arrays of integers. They are
> interpreted as collection of octets. The way these octets are handled is
> determined by two things:
>    - The array element type influences the byte size of READ/WRITE-BYTE.
>    - The external format determines how to read characters *from the
> octets*, independently of the byte size.
> This means that if you construct a binary sequence stream you will have to
> pay attention to the endianness of your data!

If I understand the above, READ-CHAR on a binary/bytes sequence stream,
created on top of an array of UTF-8 octets would then work if the
EXTERNAL-FORMAT was UTF-8 (and signal a decoding error with restart on
invalid octets).  If that's what it means, this is what I need.  I need
to leverage ECL's UTF-8 decoder to convert arbitrary binary byte/octets
to create unicode strings (without needing to worry about the internal
codepoint representation of ECL strings).

I also need to be able to encode ECL unicode characters (and strings,
also without worrying about the internal representation) to UTF-8
binary octets (and this part worked fine in your initial sequence
streams implementation).  Meaning that WRITE-CHAR to a binary bytes
vector stream with UTF-8 octets would work with UTF-8 EXTERNAL-FORMAT.

In this sense, the external format has to do with the binary sequence
format only, with the character and string representation remaining
internal and transparent.

> * Common Lisp strings have a fixed external format: latin-1 and unicode in
> ECL. This can not be changed and I do not want to change it in the
> foreseeable future. In consequence with the previous statement, my code did
> not contemplate reinterpretation of strings with different external formats.
> I still feel uneasy about this idea, because this is only a signature that
> you got your data wrong. Nevertheless I have made the following changes.

I understand that ECL uses the internal representation that it wishes
(currently UBCS-4 for unicode, or LATIN-1), and that string streams will
also use that, and I shouldn't have to worry about it.

I don't think that I need reinterpretation of strings, there must have
been some misunderstanding there, possibly relating to a bug in the
previous example/test code I sent.  Sorry about that if so.

> - If no external format is supplied, a sequence stream based on a string
> works just like a string stream. Stop reading here then.
> 
> - Otherwise, if the string is a base-char strings work like binary streams
> with a byte size of 8 bits. This means they can be used for converting to
> and from different external formats by reinterpreting the same characters as
> octets.
> 
> - If the string contains extended characters, this fact is ignored and the
> string is interpreted as if it contained just 8-bit characters for external
> encodings. This means that now you can recode strings and ignore whether the
> string was stored in a Unicode or a base-char string.

I guess that I could use this functionality to "truncate" characters,
but this was already possible without too much trouble by user code and
I don't really need this.

On a tengent, I remember a previous thread where the internal ECL
unicode format representation was discussed, that it could perhaps be
changed eventually, at least on Windows, and that I replied that I
already had code relying on the internal representation being
host-endian UBCS-4.

Well with the new sequence streams, I will be able to stop using my own
UTF-8 encoder/decoder too, and it won't be necessary anymore to worry
about the internal representation, as ECL native encoder/decoders will
no longer be a black box for user code.

Previously, the only way to leverage the ECL character
encoding/decoding libraries from user code was to use files or
sockets.  For anything more efficient or more complex I had to use
custom encoding/decoding code.

Unfortunately the mmap changes broke the ECL build and I couldn't immediately test your latest changes:

;;; Emitting code for INSTALL-BYTECODES-COMPILER.
;;; Note:
;;;   Invoking external command:
;;;   gcc -I. -I/home/mmondor/work/ecl-git/ecl/build/ -DECL_API -I/home/mmondor/work/ecl-git/ecl/build/c -I/usr/pkg/include -march=i686 -O2 -g -fPIC -Dnetbsd -I/home/mmondor/work/ecl-git/ecl/src/c -O2 -w -c clos/bytecmp.c -o clos/bytecmp.o 
;;; Finished compiling EXT:BYTECMP;BYTECMP.LSP.
;;;

Condition of type: SIMPLE-ERROR
EXT::MMAP failed.
Explanation: Invalid argument.
No restarts available.

Top level in: #<process TOP-LEVEL>.
> 

It's possible that  MAP_FILE | MAP_SHARED  be needed rather than only
MAP_SHARED.

Thanks again,
-- 
Matt