[cffi-devel] a thought on string encodings

Tue Jan 3 13:24:55 UTC 2006

Hi,
James Bielman wrote:
>I think CLISP has enough to implement these fairly efficiently,
The devil's details lie in the many keyword arguments...
For example, all :offsets means that the CLISP code must provide for possible addition, instead of directly using the pointer. But it's ok to leave it in.

For example, I somewhat dislike LispWork's convert-from-foreign-string because it takes both :length and :null-terminated-p keys, which are not orthogonal.
:length nil :null-terminated-p nil is an error.
(Perhaps I should also dislike the ANSI-CL functions with :Test and :Test-not etc.?)
Plenty of keyword arguments tend to make implementations clumsy.
Also, they tend to hide the low-level basic functionality from the nice-to-have (cf. cffi:foreign-alloc).

Most of your proposal seems in-line with ideas I had for APIs for CLISP, however the following stroke me:

>Function: FOREIGN-STRING-TO-LISP pointer encoding &key start end
>Convert octets from START to END (octet indices)
I think
a) you should leave :start and :end to Lisp vectors
   and stick to :offset with foreign pointers
b) octets indices will be unnatural and thus error-prone for people mostly on MS-windows, constantly manipulating UTF-16 strings.

I suggest you use
:offset (in bytes, as usual) and
:count (also in bytes/octets) instead.
Unlike LispWork's function, these are othogonal.

>Function: LISP-STRING-TO-FOREIGN string encoding &key start end
>Function: LISP-STRING-OCTET-LENGTH string encoding &key start end
:start + :end consistently provided for Lisp strings is good.

>Function: FOREIGN-STRING-LENGTH pointer encoding
>This should be smart enough to look for 8-bit vs 16-bit null
>terminators, as appropriate for the encoding.
TRT

>I don't think it will support much beyond :ascii or :iso-8859-1 in
>non-Unicode Lisps---I don't want to encumber CFFI with a bunch of
>character code tables.
I agree, but
You don't have users in Asia yet, do you?
In other words, you should provide for extensibility. One way would be to pass non-recognized encoding designators through to the underlying Lisp (which would make it difficult to find out the number of zero terminators for functions of yours that need it, and possibly others quirks waiting at the corner).

>Function: LISP-STRING-TO-FOREIGN string encoding &key buffer
>BUFFER must be large enough to accommodate the foreign
>string---this can be queried with LISP-STRING-OCTET-LENGTH.
This will lead to double work being done by mbslen() et al. on behalf of people calling first *-length, then *-to-foreign.
While I see the need to be able to write directly into a buffer area, there ought to be better APIs than this one.

Regards,
	Jorg Hohle.