[elephant-devel] new elephant and unicode troubles
Ian Eslick
eslick at csail.mit.edu
Sun Feb 25 19:28:53 UTC 2007
Character encodings and unicode give me headaches!
Thanks Henrik for pointing out that bug. I should rename the UTF-8
function - to me it means UTF-8/16le/32le read-compatible, assuming
that the underlying lisp is using ASCII or Unicode character codings.
Ties, there was a legacy issue caused by a memutil commitment to char
*'s which caused SBCL's alien package to barf on writing positive
#xFF values to buffer-streams (Marco wrote about this in October or
November). I recently rewrote buffer-streams and memutil so that
everything was unsigned byte but forgot to propagate that to
unicode2.lisp so we could encode 8-bit quantities in the 'utf-8'
encoding function.
I didn't think through the different code-char/char-code maps across
lisps. Does someone with more familiarity want to explain that
concisely to me? I'm very open to a better solution and there's no
time like the present!
In fact, I just put all post 0.6.1 open issues into Trac (http://
trac.common-lisp.net/elephant). Check out our new web pages too
(http://www.common-lisp.net/project/elephant/).
Would someone like to volunteer to write a ticket proposing what the
right string encoding/decoding solution should look like and/or
putting up a Wiki page so we can have a discussion there documenting
all the good ideas?
There are four constraints driving an encoding/decoding solution:
1) performance - every slot access, every symbol and object
serialization and most cursor ops end up doing string encoding and
decoding. The current approach should allow us to optimize the read
path. While I haven't implemented this logic yet, for certain pairs
of encodings and lisps, fixed length 8,16 or 32 bit strings can be
cheaply copied into a known fixed-length internal format, (bypassing
code-char) which should be very helpful for decoding-dominated cursor
traversals (searching) that will be employed during querying. (That
said, serialization is a small cost now compared to doing non-
transactional read/writes where the sync to disk takes all the time).
2) Lisp and C - the encoded format must be parsable by the C sorting
function used to compare on-disk BDB keys and values for BTree
searching & insertion. Currently we use existing C library functions
and some custom functions in libmemutil.c to do character by
character encoding. This will get more complicated if we use
variable-length encodings (proper UTF8) or other encodings that are
incompatible with the C functions. I'm happy to do or support this
work if it leads to a more elegant overall solution.
3) clarity - all the conditionals and complex dependencies
determining a particular binary format were giving me headaches, and
was the primary reason I did the rewrite.
4) compatibility - It would be useful if lisps could read other
lisp's databases, but not critical. Perhaps tagging strings with the
lisp version that created it and the external format would be a
reasonable compromise so you'd get an incompatible-encoding condition
instead of a corrupted string?
Ties, I'll look into your circular dependency problem a little later
today.
Ian
On Feb 25, 2007, at 7:58 AM, Ties Stuij wrote:
>> Hi Ties! Nothing beats a sunday morning bughunt!
> Aint that the truth!
>
> to clarify:
> as i understand it, the code just wants to circumvent the whole
> implementation dependent unicode troubles, by relying on char codes.
> This is fine, but this would mean that elephant db-ses encoded by one
> implementation in one character set can not be properly decoded by
> other implementations or the same implementation which are set to
> other encoding types. Also this forces a variable width format to
> fixed byte length, doubling strings in length if just one char is
> above the mean so to speak.
>
> I for one would rather reintroduce the unicode troubles by dispatching
> on string-type and then using something similar to, if not exactly the
> same as arnesi's string-to-octets and octets-to-string to encode and
> decode the string to bytes in stead of using char-code and code-char.
> This shouldn't be to hard to implement and would have the added bonus
> of making the code a lot cleaner.
>
> But then of course i just dont have the time atm to implement it, and
> it doesn't have so much priority for me, but if you wait half a year
> i'll send a patch.
>
> In the meantime i would in addition to henriks suggestion suggest to
> heighten the upper-limit for char-codes in serialize-to-utf8 from x7f
> to xff. E.g. this part:
>
> (etypecase string
> (simple-string
> (loop for i fixnum from 0 below characters do
> (let ((code (char-code (schar string i))))
> (declare (type fixnum code))
> (when (> code #xff) (fail))
> (setf (uffi:deref-array buffer
> 'array-or-pointer-char (+ i size)) code))))
> (string
> (loop for i fixnum from 0 below characters do
> (let ((code (char-code (char string i))))
> (declare (type fixnum code))
> (when (> code #xff) (fail))
> (setf (uffi:deref-array buffer
> 'array-or-pointer-char (+ i size)) code)))))
>
> this will reduce the times you would need the two-byte encoding to
> almost never from where i'm sitting, and i don't see the point in
> having an upper limit of x7f. If your lisp doesn't support char-codes
> higher than x7f it won't try to encode them, and why not use all the
> room the byte has to offer. Or am i missing something? That's usually
> the case.
>
> greets,
> Ties
> _______________________________________________
> elephant-devel site list
> elephant-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/elephant-devel
More information about the elephant-devel
mailing list