[phemlock-devel] Hemlock buffer API and representation

Tue Jul 27 10:53:02 UTC 2004

Hello all, 

[just in case some of you forgot to subscribe to the list, I put you
on CC this one time] 

I have been thinking about the API and the internal representation of
a buffer.

First, I think we should start by generalizing the contents to Unicode
characters.  It would be OK with me, if we choose some normalized form
and stick to it.  I also think that we should not generalize further
right now (allowing arbitrary objects), but what I am suggesting below
is compatible with such an extension. 

Concerning the API first, we should get rid of the "doubly-linked list
of lines".  The API should not mention the existence of lines, except
as the text that is contained between two newline characters.

It is probably still a good thing to represent the buffer internally
as a sequence (but not necessarily a doubly-linked list) of lines, but
that's another story.  I know Gilbert would like to associate reader
state with lines, but as I will try to show below, it does not have to
be lines, and in fact, it would not be very optimal to associate it
with lines. 

For the internal representation, I suggest representing the buffer as
a cursorchain (see our flexichain paper at the first European workshop
on Lisp and Scheme) of lines.  There would be many different
representations of a line.  At first, I thought that it would be a
good idea to have a small number of `open' lines represented as
cursorchains of Unicode characters (or fixnums if the Lisp
implementation does not have Unicode characters), and the others would
have some kind of compressed format such as a UTF-8 string or simply a
gzipped version of the line.  Such transformations would complicate
search functions that would have to `open' every line, with possibly
huge performance penalties, both in terms of CPU and memory
consumption.  

Instead I suggest that open lines be represented as cursorchains and
closed lines as vectors, but in both cases, there can be different
element types.  At least three different element types would be used:
one-byte Unicode character (used when the line contains only Unicode
characters from 0 to 255, which includes latin-1), two-byte Unicode
character (in most other cases) or four-byte Unicode characters (for
serious Unicode users).  The buffer would automatically convert from
one-byte to two-byte to four-byte whenever an attempt is made to
insert a Unicode character that is beyond what can be represented.
When a line is closed, it is scanned to see whether a more compact
representation can be used.  I suggest keeping several lines open and
closing them in an LRU-like way.

This representation has the interesting property that every character
in the buffer is always represented as a valid Unicode character, so
that nothing special needs to be done for search functions.  One could
identify some interesting special cases, such as latin-15, which would
normally require more than one byte, except that very few characters
need more than one.  The search function would be slightly harder for
such special cases, but not much.  Also, sometime in the future, one
might identify line types that can have arbitrary Lisp objects in
them.  Another obvious property of the representation is that latin-1
texts are very compact indeed, an important special case, I imagine. 

Now, concerning syntax awareness, I think that what Gilbert is
thinking of is the most promising idea so far (with minor
modifications). His idea is have a `read' function that can have its
state stored, and to store such state in association with (the
beginning of) each line in the buffer.  Modifying the contents of the
buffer would require recomputing read state from the beginning of the
line where the modification took place to the end of the (last) window
on display.  In most cases, where there is only one window on display,
there would be a very modest amount of work to do.  In the worst case
(when the first line of a huge buffer is modified, and there is
another window at the end of it), the entire buffer would need to be
recomputed. 

The only objection I have to Gilbert's proposal is that he would like
read states to be associated with lines.  In fact, if it weren't for
memory consumption, we could associate such state with every character
in the buffer.  Using a line as a unit is just a way of storing such
information every so often, often enough to minimize computations and
rarely enough that memory consumption remain reasonable.  But this is
suboptimal because there could be empty lines or lines with just a
few characters on them, and other lines that have hundreds of lines.
A better idea would be to associate read state with marks (say) every
couple hundred characters.  Such marks would be added and removed
dynamically as the text grows and shrinks so that the distribution of
marks remain roughly the same.  These read-state marks would probably
be stored in another cursorchain specific to each buffer. 

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------