[Ecls-list] Sequence streams

Juan Jose Garcia-Ripoll juanjose.garciaripoll at googlemail.com
Sun Aug 28 11:25:58 UTC 2011


On Sun, Aug 28, 2011 at 11:53 AM, Matthew Mondor
<mm_lists at pulsar-zone.net>wrote:

> On Sun, 28 Aug 2011 11:40:29 +0200
> Juan Jose Garcia-Ripoll <juanjose.garciaripoll at googlemail.com> wrote:
>
> > What is the use of writing bytes to a string? In many cases you may end
> up
> > with corrupt sequences, since the bytes produced by a external format do
> not
> > need to correspond to valid strings in the latin-1 and ucs4 formats which
> > are used internally by ECL.
>
> What I would have expected is for bytes to be decoded to characters as
> per the specified external-format, just like when reading from a file
> or network stream.  This way, the ECL unicode encoders and decoders are
> no longer a black box and no external resources are needed when custom
> encoders/decoders are used in user code...
>
> For instance in this case, URLs are encoded as UTF-8 octets with % HEX
> HEX for every needed and non-ASCII byte.  I then could perform the
> needed decoding to bytes in a custom function and read the bytes as
> UTF-8 characters.
>
> Of course, if that worked, any decoding error would be expected to also
> signal an error just like when reading bytes from a file or socket.


I think we have two different models in mind. This is how I see it

* READ/WRITE-BYTE do not have external formats. period. They are binary I-O
functions and the only customizable thing they have is the word size and the
endinanness, but they do not know about characters.

* Binary sequence streams are streams built on arrays of integers. They are
interpreted as collection of octets. The way these octets are handled is
determined by two things:
   - The array element type influences the byte size of READ/WRITE-BYTE.
   - The external format determines how to read characters *from the
octets*, independently of the byte size.
This means that if you construct a binary sequence stream you will have to
pay attention to the endianness of your data!

* Common Lisp strings have a fixed external format: latin-1 and unicode in
ECL. This can not be changed and I do not want to change it in the
foreseeable future. In consequence with the previous statement, my code did
not contemplate reinterpretation of strings with different external formats.
I still feel uneasy about this idea, because this is only a signature that
you got your data wrong. Nevertheless I have made the following changes.

- If no external format is supplied, a sequence stream based on a string
works just like a string stream. Stop reading here then.

- Otherwise, if the string is a base-char strings work like binary streams
with a byte size of 8 bits. This means they can be used for converting to
and from different external formats by reinterpreting the same characters as
octets.

- If the string contains extended characters, this fact is ignored and the
string is interpreted as if it contained just 8-bit characters for external
encodings. This means that now you can recode strings and ignore whether the
string was stored in a Unicode or a base-char string.

Your examples modified:
;;
;; Encode a string in UTF-8 binary stream (this is the safest alternative)
;;
(setf *bytes*
     (make-array 16
                 :element-type '(unsigned-byte 8)
                 :fill-pointer 0))

(with-open-stream
   (s (ext:make-sequence-output-stream *bytes*
                                       :external-format :utf-8))
 (write-string "Héhéhéhé~%" s)
 (print *bytes*))

(with-open-stream
   (s (ext:make-sequence-input-stream *bytes*
       :start 0
       :external-format :utf-8))
 (print (read-line s)))

;;
;; Convert the binary representation into a string
;;
(setf *string* (make-array 16
                           :element-type 'character
                           :fill-pointer 0
                           :adjustable t))
(with-open-stream (s (ext:make-sequence-output-stream *string*))
 (loop
    for b across *bytes*
    do (write-byte b s)
  finally
  (print *string*)))

;;
;; Encode directly into a string
;;

(setf (fill-pointer *string*) 0)
(with-open-stream
   (s (ext:make-sequence-output-stream *string*
                                       :external-format :utf-8))
 (write-string "Héhéhéhé~%" s)
 (print *string*))

;;
;; Decode the UTF-8 string
;;
(with-open-stream (s (ext:make-sequence-input-stream *string*
                         :external-format :utf-8))
  (print (read-line s)))

-- 
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110828/8c3a6044/attachment.html>


More information about the ecl-devel mailing list