[elephant-devel] new elephant and unicode troubles

Ian Eslick eslick at csail.mit.edu
Sun Feb 25 19:53:12 UTC 2007


I should clarify - Trak has a built-in wiki.  If you have a common- 
lisp.net account I can add you to our Trac system so you can edit the  
wiki pages.  (Anyone can submit and comment on tickets, however)

Ian

On Feb 25, 2007, at 2:28 PM, Ian Eslick wrote:

> Character encodings and unicode give me headaches!
>
> Thanks Henrik for pointing out that bug.  I should rename the UTF-8  
> function - to me it means UTF-8/16le/32le read-compatible, assuming  
> that the underlying lisp is using ASCII or Unicode character codings.
>
> Ties, there was a legacy issue caused by a memutil commitment to  
> char *'s which caused SBCL's alien package to barf on writing  
> positive #xFF values to buffer-streams (Marco wrote about this in  
> October or November).  I recently rewrote buffer-streams and  
> memutil so that everything was unsigned byte but forgot to  
> propagate that to unicode2.lisp so we could encode 8-bit quantities  
> in the 'utf-8' encoding function.
>
> I didn't think through the different code-char/char-code maps  
> across lisps.  Does someone with more familiarity want to explain  
> that concisely to me?  I'm very open to a better solution and  
> there's no time like the present!
>
> In fact, I just put all post 0.6.1 open issues into Trac (http:// 
> trac.common-lisp.net/elephant).  Check out our new web pages too  
> (http://www.common-lisp.net/project/elephant/).
>
> Would someone like to volunteer to write a ticket proposing what  
> the right string encoding/decoding solution should look like and/or  
> putting up a Wiki page so we can have a discussion there  
> documenting all the good ideas?
>
> There are four constraints driving an encoding/decoding solution:
>
> 1) performance - every slot access, every symbol and object  
> serialization and most cursor ops end up doing string encoding and  
> decoding.  The current approach should allow us to optimize the  
> read path. While I haven't implemented this logic yet, for certain  
> pairs of encodings and lisps, fixed length 8,16  or 32 bit strings  
> can be cheaply copied into a known fixed-length internal format,  
> (bypassing code-char) which should be very helpful for decoding- 
> dominated cursor traversals (searching) that will be employed  
> during querying.  (That said, serialization is a small cost now  
> compared to doing non-transactional read/writes where the sync to  
> disk takes all the time).
>
> 2) Lisp and C - the encoded format must be parsable by the C  
> sorting function used to compare on-disk BDB keys and values for  
> BTree searching & insertion.  Currently we use existing C library  
> functions and some custom functions in libmemutil.c to do character  
> by character encoding.  This will get more complicated if we use  
> variable-length encodings (proper UTF8) or other encodings that are  
> incompatible with the C functions.  I'm happy to do or support this  
> work if it leads to a more elegant overall solution.
>
> 3) clarity - all the conditionals and complex dependencies  
> determining a particular binary format were giving me headaches,  
> and was the primary reason I did the rewrite.
>
> 4) compatibility - It would be useful if lisps could read other  
> lisp's databases, but not critical.  Perhaps tagging strings with  
> the lisp version that created it and the external format would be a  
> reasonable compromise so you'd get an incompatible-encoding  
> condition instead of a corrupted string?
>
> Ties, I'll look into your circular dependency problem a little  
> later today.
>
> Ian
>
> On Feb 25, 2007, at 7:58 AM, Ties Stuij wrote:
>
>>> Hi Ties! Nothing beats a sunday morning bughunt!
>> Aint that the truth!
>>
>> to clarify:
>> as i understand it, the code just wants to circumvent the whole
>> implementation dependent unicode troubles, by relying on char codes.
>> This is fine, but this would mean that elephant db-ses encoded by one
>> implementation in one character set can not be properly decoded by
>> other implementations or the same implementation which are set to
>> other encoding types. Also this forces a variable width format to
>> fixed byte length, doubling strings in length if just one char is
>> above the mean so to speak.
>>
>> I for one would rather reintroduce the unicode troubles by  
>> dispatching
>> on string-type and then using something similar to, if not exactly  
>> the
>> same as arnesi's string-to-octets  and octets-to-string to encode and
>> decode the string to bytes in stead of using char-code and code-char.
>> This shouldn't be to hard to implement and would have the added bonus
>> of making the code a lot cleaner.
>>
>> But then of course i just dont have the time atm to implement it, and
>> it doesn't have so much priority for me, but if you wait half a year
>> i'll send a patch.
>>
>> In the meantime i would in addition to henriks suggestion suggest to
>> heighten the upper-limit for char-codes in serialize-to-utf8 from x7f
>> to xff. E.g. this part:
>>
>>           (etypecase string
>>              (simple-string
>>               (loop for i fixnum from 0 below characters do
>>                    (let ((code (char-code (schar string i))))
>>                      (declare (type fixnum code))
>>                      (when (> code #xff) (fail))
>>                      (setf (uffi:deref-array buffer
>> 'array-or-pointer-char (+ i size)) code))))
>>              (string
>>               (loop for i fixnum from 0 below characters do
>>                    (let ((code (char-code (char string i))))
>>                      (declare (type fixnum code))
>>                      (when (> code #xff) (fail))
>>                      (setf (uffi:deref-array buffer
>> 'array-or-pointer-char (+ i size)) code)))))
>>
>> this will reduce the times you would need the two-byte encoding to
>> almost never from where i'm sitting, and i don't see the point in
>> having an upper limit of x7f. If your lisp doesn't support char-codes
>> higher than x7f it won't try to encode them, and why not use all the
>> room the byte has to offer. Or am i missing something? That's usually
>> the case.
>>
>> greets,
>> Ties
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel




More information about the elephant-devel mailing list