[slime-devel] CMUCL unicode strings breaks slime
Raymond Toy
toy.raymond at gmail.com
Fri Oct 1 15:18:59 UTC 2010
On 10/1/10 6:35 AM, Helmut Eller wrote:
> * Raymond Toy [2010-10-01 10:20] writes:
>
>>> What is the length of *s* or (prin1-to-string *s*) now?
>>> Should it be 3 not 4?
>>
>> Good question. The answer now is 4, not 3. There are 4 code units in
>> the string, so that is the length. Length would be really slow if it
>> had to scan the whole string looking for surrogate pairs and counting
>> them as one instead of two.
>>
>> Is that the reason for the problem? Confusion between emacs and lisp on
>> the length of the string? It does appear that the string only has 3
>> characters, as displayed by emacs.
>
> Very likely, Emacs uses something like utf-8 internally and counts code points
> not code units (expect for line endings which is probably a different
> issue).
>
>> Doesn't acl have this problem too? It also uses 16-bit strings like
>> cmucl.
>
> Allegro has no lisp:codepoint function and (code-char #x10000)
> returns nil.
The lisp:codepoint function was just a convenience for creating the
necessary surrogate pair.
>
> In Java, strings have a length method which returns code units and a
> codePointCount method for the other use. Maybe CMUCL has something like
> that and we should use it in SWANK.
CMUCL doesn't currently have a codePointCount function, we that's easy
enough to add if slime wants it. Here's one:
(defun codepoint-count (string)
"Return the number of code points in the string. The string MUST be
a valid UTF-16 string."
(do ((len (length string))
(index 0 (1+ index))
(count 0 (1+ count)))
((>= index len)
count)
(multiple-value-bind (codepoint wide)
(lisp:codepoint string index)
(declare (ignore codepoint))
(when wide (incf index)))))
Ray
More information about the slime-devel
mailing list