From juanjose.garciaripoll at googlemail.com  Mon Jan 12 12:56:03 2009
From: juanjose.garciaripoll at googlemail.com (Juan Jose Garcia-Ripoll)
Date: Mon, 12 Jan 2009 13:56:03 +0100
Subject: [cl-ppcre-devel] Questions regarding cl-unicode
In-Reply-To: <c159f9ab0901120356u7b73f59fiff39db81e52c4120@mail.gmail.com>
References: <c159f9ab0901120356u7b73f59fiff39db81e52c4120@mail.gmail.com>
Message-ID: <c159f9ab0901120456w50e05c09w821c36d579c9ec95@mail.gmail.com>

Hi,

I just subscribed to this mailing list, which I believe is not only
for cl-ppcre but also for cl-unicode. If I am wrong, please point me
in the right direction :-)

My name is Juanjo and I am the maintainer of ECL
(http://ecls.sourceforge.net). I am currently interested on completing
the support for Unicode in ECL which is, more or less, at the level of
what SBCL provides and, in my opinion, far from optimal.

I have been pondering several options, but all of them seem like
reinventing the wheel, so I finally came to the conclusion that the
most sensible strategy would be to turn cl-unicode into a full
(optional) replacement of the ANSI Common Lisp functions for dealing
with characters and strings, and hope that this would become a
de-facto standard. Perhaps that is a too ambitious goal, or maybe it
is even futile, given the level of adoption of Unicode among lispers.

My concerns are now centered about several questions.

1) Optimize the database information that is built into cl-unicode.
ECL currently uses the SBCL procedure for compressing the database and
I believe this can be even optimized further. Instead of binary trees
or hashes, this leads to two-stages byte table that encodes the
currently 209 different combinations of properties. This is important
for ECL because we need it to stay lean and simple and because our
procedures for exporting data structures in compiled code are not
efficient, due to contrants in C compilers. One possibility is that
CL-UNICODE reuses the SBCL and ECL databases. Other possibility is to
look for even more efficient data stuctures.

2) Add support for the most important Unicode algorithms, which are
canonical decomposition of strings, string upper/lower/titlecasing,
and string collation. Ideally this should be transparently
incorporated into new Common-Lisp functions that can be used to
replace the old ones, such as char-upcase, string-equal, etc. Of
course, due to the differences between Unicode and ANSI CL, the
specifications would change.

3) Add support for the locales database provided by the Unicode
consortium. This is essential for implementing string collation, since
the ordering of characters is locale dependent.

4) Integration and shipping of cl-unicode with different
implementations, if possible. I would be interested on having
CL-UNICODE as a contributed package in the ECL source tree, so that it
can be activated with a simple configuration option. I believe there
are no license issues, and there is only the problem that CL-UNICODE
depends on CL-PPCRE (is this dependency essential? could it be
eliminated?)

Well, maybe this is all BS, but I would like to read your opinions on the topic.

Juanjo

--
Instituto de F?sica Fundamental, CSIC
c/ Serrano, 113b, Madrid 28009 (Spain)
http://juanjose.garciaripoll.googlepages.com

From edi at agharta.de  Wed Jan 14 20:42:34 2009
From: edi at agharta.de (Edi Weitz)
Date: Wed, 14 Jan 2009 21:42:34 +0100
Subject: [cl-ppcre-devel] Questions regarding cl-unicode
In-Reply-To: <c159f9ab0901120456w50e05c09w821c36d579c9ec95@mail.gmail.com>
References: <c159f9ab0901120356u7b73f59fiff39db81e52c4120@mail.gmail.com>
	<c159f9ab0901120456w50e05c09w821c36d579c9ec95@mail.gmail.com>
Message-ID: <b56421a00901141242u4c055aa9taaf3c50e323bf2c@mail.gmail.com>

On Mon, Jan 12, 2009 at 1:56 PM, Juan Jose Garcia-Ripoll
<juanjose.garciaripoll at googlemail.com> wrote:

> 4) Integration and shipping of cl-unicode with different
> implementations, if possible. I would be interested on having
> CL-UNICODE as a contributed package in the ECL source tree, so that it
> can be activated with a simple configuration option. I believe there
> are no license issues, and there is only the problem that CL-UNICODE
> depends on CL-PPCRE (is this dependency essential? could it be
> eliminated?)

Hi Juanjo,

Sorry for the delay.  It's fine with me if you distribute CL-UNICODE
with ECL and I also think there should be no licensing issues.

CL-PPCRE is used in a couple of places for parsing.  These could be
replaced with hand-crafted parsers, but it'd be a bit of work to do
that a) correctly, b) without blowing up the code base enormously, and
c) without significant sacrifices w.r.t. speed.  Having said that, I'm
open to accepting patches to get rid of this dependency... :)

Cheers,
Edi.


From juanjose.garciaripoll at googlemail.com  Thu Jan 15 10:07:47 2009
From: juanjose.garciaripoll at googlemail.com (Juan Jose Garcia-Ripoll)
Date: Thu, 15 Jan 2009 11:07:47 +0100
Subject: [cl-ppcre-devel] Questions regarding cl-unicode
In-Reply-To: <b56421a00901141242u4c055aa9taaf3c50e323bf2c@mail.gmail.com>
References: <c159f9ab0901120356u7b73f59fiff39db81e52c4120@mail.gmail.com>
	<c159f9ab0901120456w50e05c09w821c36d579c9ec95@mail.gmail.com>
	<b56421a00901141242u4c055aa9taaf3c50e323bf2c@mail.gmail.com>
Message-ID: <c159f9ab0901150207j459f5340i8681fbea0ecd998f@mail.gmail.com>

Hi Edi,

On Wed, Jan 14, 2009 at 9:42 PM, Edi Weitz <edi at agharta.de> wrote:
> Sorry for the delay.  It's fine with me if you distribute CL-UNICODE
> with ECL and I also think there should be no licensing issues.

Great to read that.

> CL-PPCRE is used in a couple of places for parsing.  These could be
> replaced with hand-crafted parsers, but it'd be a bit of work to do
> that a) correctly, b) without blowing up the code base enormously, and
> c) without significant sacrifices w.r.t. speed.  Having said that, I'm
> open to accepting patches to get rid of this dependency... :)

I think I will leave that for the end. I am now learning the
normalization algorithms to implement string equality comparisons
using cl-unicode as-is. Once it passes the test suites, I hope to move
to string collation and then look on things related to dependencies,
databases, etc. As I said, my goal is to integrate this in cl-unicode,
so if and when I get moving I will send patches either to you or to
the mailing list.

Juanjo

-- 
Instituto de F?sica Fundamental, CSIC
c/ Serrano, 113b, Madrid 28009 (Spain)
http://juanjose.garciaripoll.googlepages.com