[Ecls-list] Sequence streams
Juan Jose Garcia-Ripoll
juanjose.garciaripoll at googlemail.com
Sun Aug 28 11:25:58 UTC 2011
On Sun, Aug 28, 2011 at 11:53 AM, Matthew Mondor
<mm_lists at pulsar-zone.net>wrote:
> On Sun, 28 Aug 2011 11:40:29 +0200
> Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:
>
> > What is the use of writing bytes to a string? In many cases you may end
> up
> > with corrupt sequences, since the bytes produced by a external format do
> not
> > need to correspond to valid strings in the latin-1 and ucs4 formats which
> > are used internally by ECL.
>
> What I would have expected is for bytes to be decoded to characters as
> per the specified external-format, just like when reading from a file
> or network stream. This way, the ECL unicode encoders and decoders are
> no longer a black box and no external resources are needed when custom
> encoders/decoders are used in user code...
>
> For instance in this case, URLs are encoded as UTF-8 octets with % HEX
> HEX for every needed and non-ASCII byte. I then could perform the
> needed decoding to bytes in a custom function and read the bytes as
> UTF-8 characters.
>
> Of course, if that worked, any decoding error would be expected to also
> signal an error just like when reading bytes from a file or socket.
I think we have two different models in mind. This is how I see it
* READ/WRITE-BYTE do not have external formats. period. They are binary I-O
functions and the only customizable thing they have is the word size and the
endinanness, but they do not know about characters.
* Binary sequence streams are streams built on arrays of integers. They are
interpreted as collection of octets. The way these octets are handled is
determined by two things:
- The array element type influences the byte size of READ/WRITE-BYTE.
- The external format determines how to read characters *from the
octets*, independently of the byte size.
This means that if you construct a binary sequence stream you will have to
pay attention to the endianness of your data!
* Common Lisp strings have a fixed external format: latin-1 and unicode in
ECL. This can not be changed and I do not want to change it in the
foreseeable future. In consequence with the previous statement, my code did
not contemplate reinterpretation of strings with different external formats.
I still feel uneasy about this idea, because this is only a signature that
you got your data wrong. Nevertheless I have made the following changes.
- If no external format is supplied, a sequence stream based on a string
works just like a string stream. Stop reading here then.
- Otherwise, if the string is a base-char strings work like binary streams
with a byte size of 8 bits. This means they can be used for converting to
and from different external formats by reinterpreting the same characters as
octets.
- If the string contains extended characters, this fact is ignored and the
string is interpreted as if it contained just 8-bit characters for external
encodings. This means that now you can recode strings and ignore whether the
string was stored in a Unicode or a base-char string.
Your examples modified:
;;
;; Encode a string in UTF-8 binary stream (this is the safest alternative)
;;
(setf *bytes*
(make-array 16
:element-type '(unsigned-byte 8)
:fill-pointer 0))
(with-open-stream
(s (ext:make-sequence-output-stream *bytes*
:external-format :utf-8))
(write-string "Héhéhéhé~%" s)
(print *bytes*))
(with-open-stream
(s (ext:make-sequence-input-stream *bytes*
:start 0
:external-format :utf-8))
(print (read-line s)))
;;
;; Convert the binary representation into a string
;;
(setf *string* (make-array 16
:element-type 'character
:fill-pointer 0
:adjustable t))
(with-open-stream (s (ext:make-sequence-output-stream *string*))
(loop
for b across *bytes*
do (write-byte b s)
finally
(print *string*)))
;;
;; Encode directly into a string
;;
(setf (fill-pointer *string*) 0)
(with-open-stream
(s (ext:make-sequence-output-stream *string*
:external-format :utf-8))
(write-string "Héhéhéhé~%" s)
(print *string*))
;;
;; Decode the UTF-8 string
;;
(with-open-stream (s (ext:make-sequence-input-stream *string*
:external-format :utf-8))
(print (read-line s)))
--
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110828/8c3a6044/attachment.html>
More information about the ecl-devel
mailing list