[slime-devel] CMUCL unicode strings breaks slime

Raymond Toy toy.raymond at gmail.com
Fri Oct 1 15:18:59 UTC 2010


On 10/1/10 6:35 AM, Helmut Eller wrote:
> * Raymond Toy [2010-10-01 10:20] writes:
> 
>>> What is the length of *s* or (prin1-to-string *s*) now?
>>> Should it be 3 not 4?
>>
>> Good question.  The answer now is 4, not 3.  There are 4 code units in
>> the string, so that is the length.  Length would be really slow if it
>> had to scan the whole string looking for surrogate pairs and counting
>> them as one instead of two.
>>
>> Is that the reason for the problem?  Confusion between emacs and lisp on
>> the length of the string?  It does appear that the string only has 3
>> characters, as displayed by emacs.
> 
> Very likely, Emacs uses something like utf-8 internally and counts code points
> not code units (expect for line endings which is probably a different
> issue).
> 
>> Doesn't acl have this problem too?  It also uses 16-bit strings like
>> cmucl.
> 
> Allegro has no lisp:codepoint function and (code-char #x10000) 
> returns nil.  

The lisp:codepoint function was just a convenience for creating the
necessary surrogate pair.
> 
> In Java, strings have a length method which returns code units and a
> codePointCount method for the other use.  Maybe CMUCL has something like
> that and we should use it in SWANK.

CMUCL doesn't currently have a codePointCount function, we that's easy
enough to add if slime wants it.  Here's one:

(defun codepoint-count (string)
  "Return the number of code points in the string.  The string MUST be
  a valid UTF-16 string."
  (do ((len (length string))
       (index 0 (1+ index))
       (count 0 (1+ count)))
      ((>= index len)
       count)
    (multiple-value-bind (codepoint wide)
	(lisp:codepoint string index)
      (declare (ignore codepoint))
      (when wide (incf index)))))

Ray





More information about the slime-devel mailing list