[cffi-devel] a thought on string encodings

Hoehle, Joerg-Cyril Joerg-Cyril.Hoehle at t-systems.com
Mon Jan 2 13:18:46 UTC 2006


James Bielman wrote:
>I'm working on (to begin with), a UTF8-STRING type
>which converts Lisp strings to/from UTF-8 on Unicode Lisps.

>(Does the CLISP FFI provide something like memcpy?)
No. Some voice in me says I should rather implement the vector<->memory functions for CLISP instead of playing around with cffi, Iterate, reading interesting papers etc. when I have a little time.

>So, I think I need that block interface we've talked about. 
Reviewing the proposal has been on my TODO list for a long time as well, I'm sorry.

>  (ffi:with-foreign-string (ptr chars bytes s :encoding charset:utf-8)
>    (let ((buf (foreign-alloc :unsigned-char :count bytes)))
>      (memcpy buf ptr bytes)
I think I'd rather use ext:convert-string-to-bytes, then use my non-existent vector->memory function.  Sometimes I feel uneasy about stack-allocating possibly huge strings (e.g. 1MB or more!).
In the meantime,
(let* ((bytes (ext:convert ...))
       (len (length bytes))
       (buf foreign-alloc len))
  (setf (memory-as buf (parse-c-type `(c-array uint8 ,len)))
         bytes)
should be among the fastest ones (CLISP can copy the (c-array uint8 N) type fast). And people keep saying the generational GC should be able to free the garbage vector easily.


>However, I haven't been able to find an inverse for
>FFI:WITH-FOREIGN-STRING.
Indeed, I should ... (see above) and implement string-from-foreign and (foreign-string-length :encoding) as a means to interface to mblen().

>I'd like to be able to convert a pointer back
>to a Lisp string without looping in bytecode to create a vector of
>octets from the pointer.
I suggest the converse of the above, via an array of (unsigned-byte 8).

> I tried a
>whole bunch of combinations of FFI:MEMORY-AS with FFI:C-ARRAY-PTR types
>and got nothing but segfaults.
Please report a bug, but possibly you just did not re-read recent impnotes closely enough.  For instance, while the FFI now accepts non 1:1 encodings, it's use with c-array-ptr and c-array[-max] is mostly broken and some parts revert to a 1:1 encoding (see *foreign-8bit-encoding*).  I would have left the 1:1 restriction and take more time to think about the problems. :-(

The operators that explicitly take an :encoding are safe, like with-foreign-string.

Also, custom:*foreign-encoding* is a symbol macro. Thus (let ((custom:*foreign-encoding* charset:foo))) won't work as expected, you need setq and unwind-protect. :-(

> Is there something I can use to convert
>the pointer to either a vector of octets (which I can pass to
>EXT:CONVERT-STRING-FROM-BYTES, or to a Lisp string directly?
(memory-as pointer (parse-c-type `(c-array uint8 ,len)))

Looking at the clisp sources, (c-array character N) -> Lisp string seems ok as well.  But e.g. don't use (c-array-max character N) with UTF-16!
As you can see, there's room for improvement.

Summary:
ptr -> Lisp string:
either ext:convert + memory-as uint8
or (let ((old *foreign-encoding*))
     (unwind-protect (progn (setq *foreign-encoding* utf-8)
       (memory-as pointer (parse-c-type `(c-array character ,known-length)))
       )(setq *foreign-encoding* old))) ; looks scary :-(
Do you need unknown-length as well?
Lisp string -> ptr:
ext:convert + memory-as uint8 as mentioned above.

Regards,
	Jörg Höhle.



More information about the cffi-devel mailing list