[elephant-devel] new elephant and unicode troubles

Sun Feb 25 19:28:53 UTC 2007

Character encodings and unicode give me headaches!

Thanks Henrik for pointing out that bug.  I should rename the UTF-8  
function - to me it means UTF-8/16le/32le read-compatible, assuming  
that the underlying lisp is using ASCII or Unicode character codings.

Ties, there was a legacy issue caused by a memutil commitment to char  
*'s which caused SBCL's alien package to barf on writing positive  
#xFF values to buffer-streams (Marco wrote about this in October or  
November).  I recently rewrote buffer-streams and memutil so that  
everything was unsigned byte but forgot to propagate that to  
unicode2.lisp so we could encode 8-bit quantities in the 'utf-8'  
encoding function.

I didn't think through the different code-char/char-code maps across  
lisps.  Does someone with more familiarity want to explain that  
concisely to me?  I'm very open to a better solution and there's no  
time like the present!

In fact, I just put all post 0.6.1 open issues into Trac (http:// 
trac.common-lisp.net/elephant).  Check out our new web pages too  
(http://www.common-lisp.net/project/elephant/).

Would someone like to volunteer to write a ticket proposing what the  
right string encoding/decoding solution should look like and/or  
putting up a Wiki page so we can have a discussion there documenting  
all the good ideas?

There are four constraints driving an encoding/decoding solution:

1) performance - every slot access, every symbol and object  
serialization and most cursor ops end up doing string encoding and  
decoding.  The current approach should allow us to optimize the read  
path. While I haven't implemented this logic yet, for certain pairs  
of encodings and lisps, fixed length 8,16  or 32 bit strings can be  
cheaply copied into a known fixed-length internal format, (bypassing  
code-char) which should be very helpful for decoding-dominated cursor  
traversals (searching) that will be employed during querying.  (That  
said, serialization is a small cost now compared to doing non- 
transactional read/writes where the sync to disk takes all the time).

2) Lisp and C - the encoded format must be parsable by the C sorting  
function used to compare on-disk BDB keys and values for BTree  
searching & insertion.  Currently we use existing C library functions  
and some custom functions in libmemutil.c to do character by  
character encoding.  This will get more complicated if we use  
variable-length encodings (proper UTF8) or other encodings that are  
incompatible with the C functions.  I'm happy to do or support this  
work if it leads to a more elegant overall solution.

3) clarity - all the conditionals and complex dependencies  
determining a particular binary format were giving me headaches, and  
was the primary reason I did the rewrite.

4) compatibility - It would be useful if lisps could read other  
lisp's databases, but not critical.  Perhaps tagging strings with the  
lisp version that created it and the external format would be a  
reasonable compromise so you'd get an incompatible-encoding condition  
instead of a corrupted string?

Ties, I'll look into your circular dependency problem a little later  
today.

Ian

On Feb 25, 2007, at 7:58 AM, Ties Stuij wrote:

>> Hi Ties! Nothing beats a sunday morning bughunt!
> Aint that the truth!
>
> to clarify:
> as i understand it, the code just wants to circumvent the whole
> implementation dependent unicode troubles, by relying on char codes.
> This is fine, but this would mean that elephant db-ses encoded by one
> implementation in one character set can not be properly decoded by
> other implementations or the same implementation which are set to
> other encoding types. Also this forces a variable width format to
> fixed byte length, doubling strings in length if just one char is
> above the mean so to speak.
>
> I for one would rather reintroduce the unicode troubles by dispatching
> on string-type and then using something similar to, if not exactly the
> same as arnesi's string-to-octets  and octets-to-string to encode and
> decode the string to bytes in stead of using char-code and code-char.
> This shouldn't be to hard to implement and would have the added bonus
> of making the code a lot cleaner.
>
> But then of course i just dont have the time atm to implement it, and
> it doesn't have so much priority for me, but if you wait half a year
> i'll send a patch.
>
> In the meantime i would in addition to henriks suggestion suggest to
> heighten the upper-limit for char-codes in serialize-to-utf8 from x7f
> to xff. E.g. this part:
>
>           (etypecase string
>              (simple-string
>               (loop for i fixnum from 0 below characters do
>                    (let ((code (char-code (schar string i))))
>                      (declare (type fixnum code))
>                      (when (> code #xff) (fail))
>                      (setf (uffi:deref-array buffer
> 'array-or-pointer-char (+ i size)) code))))
>              (string
>               (loop for i fixnum from 0 below characters do
>                    (let ((code (char-code (char string i))))
>                      (declare (type fixnum code))
>                      (when (> code #xff) (fail))
>                      (setf (uffi:deref-array buffer
> 'array-or-pointer-char (+ i size)) code)))))
>
> this will reduce the times you would need the two-byte encoding to
> almost never from where i'm sitting, and i don't see the point in
> having an upper limit of x7f. If your lisp doesn't support char-codes
> higher than x7f it won't try to encode them, and why not use all the
> room the byte has to offer. Or am i missing something? That's usually
> the case.
>
> greets,
> Ties
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel