[Ecls-list] UTF-8 sequence decoding errors [Was: Upcoming changes]

Juan Jose Garcia-Ripoll juanjose.garciaripoll at googlemail.com
Sat Feb 12 18:07:43 UTC 2011


Thanks for the detailed report. I made some changes.

* The exported symbols come from the EXT package. They are

character-coding-error
character-coding-error-external-format
character-decoding-error
character-decoding-error-octets
character-encoding-error
character-encoding-error-code
stream-encoding-error
stream-decoding-error

* Two restarts are provided USE-VALUE and CONTINUE. They can be used via the
ANSI functions with the same name (I think you missed that point regarding
USE-VALUE)

* Encoding errors are also now created. Before the function had not been
plugged into the engine.

* I am not likely to provide multi-character restarts for a simple reason:
ECL's streams are too simple, not providing arbitrary push-back buffers for
bytes. Having a USE-VALUE restart that returns more than one character may
lead to unexpected problems with unread-char and other functions -- I do not
mean it is impossible but it simply complicates the interface and right now
I have no clear idea how to do that.

I attached a modified version of your code.

Best,

Juanjo

-- 
Instituto de Física Fundamental, CSIC
c/ Serrano, 113b, Madrid 28006 (Spain)
http://juanjose.garciaripoll.googlepages.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://mailman.common-lisp.net/pipermail/ecl-devel/attachments/20110212/3b8f43e9/attachment.html>
-------------- next part --------------

(defun custom-read-line (stream &key (max 512) (replace-char))
  (let ((line (make-array max
			  :element-type 'character
			  :adjustable t
			  :fill-pointer 0)))
    (flet ((add-char (c)
	     (declare (type character c))
	     (vector-push c line))
	   (finalize-line ()
	     (let ((len (length line)))
	       (when (and (> len 0)
			  (char= #\Return (aref line (1- len))))
		 (vector-pop line)))
	     line))
      (loop
	  do
	   (let (
		 ;; No way to determine invalid octet values with old ECL,
		 ;; Return an unknown character code
		 (c #+old-ecl(handler-case
			     (read-char stream)
			   (simple-error ()
			     #\UFFFD))
		    ;; SBCL provides invalid octets which we can import and
		    ;; then issue an ATTEMPT-RESYNC restart to resume
		    #+sbcl(handler-bind
			      ((sb-int:stream-decoding-error
				#'(lambda (e)
				    ;; Treat invalid UTF-8 octets as
				    ;; ISO-8859 characters.
				    (mapcar #'(lambda (c)
						(when (> c 127)
						  (add-char (code-char c))))
					    (sb-int:character-decoding-error-octets e))
				    (invoke-restart 'sb-int:attempt-resync))))
			    (read-char stream))
		    ;; Test with new ECL
		    #+ecl(handler-bind
			     ((ext:character-decoding-error ; Internal
			       #'(lambda (e)
				   (mapcar #'(lambda (c)
					       (format t "~%Code: ~A" c)
					       (when (> c 127)
						 ;; Never happens
						 (add-char (code-char c))))
					   ;; Not advertized interface?
					   (ext:character-decoding-error-octets e))
                                   ;; Either replace the character or ignore
                                   (if replace-char
                                       (use-value #\?)
                                       (continue))
				   )))
			   (read-char stream)))
		 )
	     (when (char= #\Newline c)
	       (return (values (finalize-line) t)))
	     (add-char c))))))

(defun test (&rest args)
  (with-open-file (stream "InvalidUTF8.txt")
    (loop
       do
	 (let ((line (handler-case
			 (apply #'custom-read-line stream args)
		       (end-of-file ()
			 (loop-finish)))))
	   (format t "~A~%" line)))))

(test)
#+ecl
(test :replace-char #\?)


More information about the ecl-devel mailing list