[elephant-devel] new elephant and unicode troubles
Ian Eslick
eslick at csail.mit.edu
Sun Feb 25 19:53:12 UTC 2007
I should clarify - Trak has a built-in wiki. If you have a common-
lisp.net account I can add you to our Trac system so you can edit the
wiki pages. (Anyone can submit and comment on tickets, however)
Ian
On Feb 25, 2007, at 2:28 PM, Ian Eslick wrote:
> Character encodings and unicode give me headaches!
>
> Thanks Henrik for pointing out that bug. I should rename the UTF-8
> function - to me it means UTF-8/16le/32le read-compatible, assuming
> that the underlying lisp is using ASCII or Unicode character codings.
>
> Ties, there was a legacy issue caused by a memutil commitment to
> char *'s which caused SBCL's alien package to barf on writing
> positive #xFF values to buffer-streams (Marco wrote about this in
> October or November). I recently rewrote buffer-streams and
> memutil so that everything was unsigned byte but forgot to
> propagate that to unicode2.lisp so we could encode 8-bit quantities
> in the 'utf-8' encoding function.
>
> I didn't think through the different code-char/char-code maps
> across lisps. Does someone with more familiarity want to explain
> that concisely to me? I'm very open to a better solution and
> there's no time like the present!
>
> In fact, I just put all post 0.6.1 open issues into Trac (http://
> trac.common-lisp.net/elephant). Check out our new web pages too
> (http://www.common-lisp.net/project/elephant/).
>
> Would someone like to volunteer to write a ticket proposing what
> the right string encoding/decoding solution should look like and/or
> putting up a Wiki page so we can have a discussion there
> documenting all the good ideas?
>
> There are four constraints driving an encoding/decoding solution:
>
> 1) performance - every slot access, every symbol and object
> serialization and most cursor ops end up doing string encoding and
> decoding. The current approach should allow us to optimize the
> read path. While I haven't implemented this logic yet, for certain
> pairs of encodings and lisps, fixed length 8,16 or 32 bit strings
> can be cheaply copied into a known fixed-length internal format,
> (bypassing code-char) which should be very helpful for decoding-
> dominated cursor traversals (searching) that will be employed
> during querying. (That said, serialization is a small cost now
> compared to doing non-transactional read/writes where the sync to
> disk takes all the time).
>
> 2) Lisp and C - the encoded format must be parsable by the C
> sorting function used to compare on-disk BDB keys and values for
> BTree searching & insertion. Currently we use existing C library
> functions and some custom functions in libmemutil.c to do character
> by character encoding. This will get more complicated if we use
> variable-length encodings (proper UTF8) or other encodings that are
> incompatible with the C functions. I'm happy to do or support this
> work if it leads to a more elegant overall solution.
>
> 3) clarity - all the conditionals and complex dependencies
> determining a particular binary format were giving me headaches,
> and was the primary reason I did the rewrite.
>
> 4) compatibility - It would be useful if lisps could read other
> lisp's databases, but not critical. Perhaps tagging strings with
> the lisp version that created it and the external format would be a
> reasonable compromise so you'd get an incompatible-encoding
> condition instead of a corrupted string?
>
> Ties, I'll look into your circular dependency problem a little
> later today.
>
> Ian
>
> On Feb 25, 2007, at 7:58 AM, Ties Stuij wrote:
>
>>> Hi Ties! Nothing beats a sunday morning bughunt!
>> Aint that the truth!
>>
>> to clarify:
>> as i understand it, the code just wants to circumvent the whole
>> implementation dependent unicode troubles, by relying on char codes.
>> This is fine, but this would mean that elephant db-ses encoded by one
>> implementation in one character set can not be properly decoded by
>> other implementations or the same implementation which are set to
>> other encoding types. Also this forces a variable width format to
>> fixed byte length, doubling strings in length if just one char is
>> above the mean so to speak.
>>
>> I for one would rather reintroduce the unicode troubles by
>> dispatching
>> on string-type and then using something similar to, if not exactly
>> the
>> same as arnesi's string-to-octets and octets-to-string to encode and
>> decode the string to bytes in stead of using char-code and code-char.
>> This shouldn't be to hard to implement and would have the added bonus
>> of making the code a lot cleaner.
>>
>> But then of course i just dont have the time atm to implement it, and
>> it doesn't have so much priority for me, but if you wait half a year
>> i'll send a patch.
>>
>> In the meantime i would in addition to henriks suggestion suggest to
>> heighten the upper-limit for char-codes in serialize-to-utf8 from x7f
>> to xff. E.g. this part:
>>
>> (etypecase string
>> (simple-string
>> (loop for i fixnum from 0 below characters do
>> (let ((code (char-code (schar string i))))
>> (declare (type fixnum code))
>> (when (> code #xff) (fail))
>> (setf (uffi:deref-array buffer
>> 'array-or-pointer-char (+ i size)) code))))
>> (string
>> (loop for i fixnum from 0 below characters do
>> (let ((code (char-code (char string i))))
>> (declare (type fixnum code))
>> (when (> code #xff) (fail))
>> (setf (uffi:deref-array buffer
>> 'array-or-pointer-char (+ i size)) code)))))
>>
>> this will reduce the times you would need the two-byte encoding to
>> almost never from where i'm sitting, and i don't see the point in
>> having an upper limit of x7f. If your lisp doesn't support char-codes
>> higher than x7f it won't try to encode them, and why not use all the
>> room the byte has to offer. Or am i missing something? That's usually
>> the case.
>>
>> greets,
>> Ties
>> _______________________________________________
>> elephant-devel site list
>> elephant-devel at common-lisp.net
>> http://common-lisp.net/mailman/listinfo/elephant-devel
>
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
More information about the elephant-devel
mailing list