On Sun, Aug 28, 2011 at 11:53 AM, Matthew Mondor <span dir="ltr"><<a href="mailto:mm_lists@pulsar-zone.net">mm_lists@pulsar-zone.net</a>></span> wrote:<br><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex;">
On Sun, 28 Aug 2011 11:40:29 +0200<br>
<div class="im">Juan Jose Garcia-Ripoll <<a href="mailto:juanjose.garciaripoll@googlemail.com">juanjose.garciaripoll@googlemail.com</a>> wrote:<br>
<br>
</div><div class="im">> What is the use of writing bytes to a string? In many cases you may end up<br>
> with corrupt sequences, since the bytes produced by a external format do not<br>
> need to correspond to valid strings in the latin-1 and ucs4 formats which<br>
> are used internally by ECL.<br>
<br>
</div>What I would have expected is for bytes to be decoded to characters as<br>
per the specified external-format, just like when reading from a file<br>
or network stream. This way, the ECL unicode encoders and decoders are<br>
no longer a black box and no external resources are needed when custom<br>
encoders/decoders are used in user code...<br>
<br>
For instance in this case, URLs are encoded as UTF-8 octets with % HEX<br>
HEX for every needed and non-ASCII byte. I then could perform the<br>
needed decoding to bytes in a custom function and read the bytes as<br>
UTF-8 characters.<br>
<br>
Of course, if that worked, any decoding error would be expected to also<br>
signal an error just like when reading bytes from a file or socket.</blockquote></div><br>I think we have two different models in mind. This is how I see it<br><br>* READ/WRITE-BYTE do not have external formats. period. They are binary I-O functions and the only customizable thing they have is the word size and the endinanness, but they do not know about characters.<br>
<br>* Binary sequence streams are streams built on arrays of integers. They are interpreted as collection of octets. The way these octets are handled is determined by two things:<br> - The array element type influences the byte size of READ/WRITE-BYTE.<br>
- The external format determines how to read characters *from the octets*, independently of the byte size.<br>This means that if you construct a binary sequence stream you will have to pay attention to the endianness of your data!<br>
<br>* Common Lisp strings have a fixed external format: latin-1 and unicode in ECL. This can not be changed and I do not want to change it in the foreseeable future. In consequence with the previous statement, my code did not contemplate reinterpretation of strings with different external formats. I still feel uneasy about this idea, because this is only a signature that you got your data wrong. Nevertheless I have made the following changes.<br>
<br>- If no external format is supplied, a sequence stream based on a string works just like a string stream. Stop reading here then.<br><br>- Otherwise, if the string is a base-char strings work like binary streams with a byte size of 8 bits. This means they can be used for converting to and from different external formats by reinterpreting the same characters as octets.<br>
<br>- If the string contains extended characters, this fact is ignored and the string is interpreted as if it contained just 8-bit characters for external encodings. This means that now you can recode strings and ignore whether the string was stored in a Unicode or a base-char string.<br>
<br>Your examples modified:<br>;;<br>;; Encode a string in UTF-8 binary stream (this is the safest alternative)<br>;;<br>(setf *bytes*<br> (make-array 16<br> :element-type '(unsigned-byte 8)<br> :fill-pointer 0))<br>
<br>(with-open-stream<br> (s (ext:make-sequence-output-stream *bytes*<br> :external-format :utf-8))<br> (write-string "Héhéhéhé~%" s)<br> (print *bytes*))<br><br>(with-open-stream<br>
(s (ext:make-sequence-input-stream *bytes*<br> :start 0<br> :external-format :utf-8))<br> (print (read-line s)))<br><br>;;<br>;; Convert the binary representation into a string<br>;;<br>(setf *string* (make-array 16<br>
:element-type 'character<br> :fill-pointer 0<br> :adjustable t))<br>(with-open-stream (s (ext:make-sequence-output-stream *string*))<br> (loop<br>
for b across *bytes*<br> do (write-byte b s)<br> finally<br> (print *string*)))<br><br>;;<br>;; Encode directly into a string<br>;;<br><br>(setf (fill-pointer *string*) 0)<br>(with-open-stream<br> (s (ext:make-sequence-output-stream *string*<br>
:external-format :utf-8))<br> (write-string "Héhéhéhé~%" s)<br> (print *string*))<br><br>;;<br>;; Decode the UTF-8 string<br>;;<br>(with-open-stream (s (ext:make-sequence-input-stream *string*<br>
:external-format :utf-8))<br> (print (read-line s)))<br><br>-- <br>Instituto de Física Fundamental, CSIC<br>c/ Serrano, 113b, Madrid 28006 (Spain) <br><a href="http://juanjose.garciaripoll.googlepages.com" target="_blank">http://juanjose.garciaripoll.googlepages.com</a><br>