From strandh at labri.fr  Tue Aug  3 11:31:24 2004
From: strandh at labri.fr (Robert Strandh)
Date: Tue, 3 Aug 2004 13:31:24 +0200
Subject: [phemlock-devel] Hemlock string tables
Message-ID: <16655.30604.422222.378722@serveur5.labri.fr>

Hello,

For a few days, I have been looking around at various parts of the
code for Portable Hemlock.  Much code is quite easy to understand
and/or sufficiently documented that I can understand it right away.  

Then, I came across `table.lisp' in the `core' directory.  This
library makes it possible to do strange completions (like some
versions of CLIM can do) where strings like "l th d t" can be expanded
to "Load The Damned Thing".

It took me the better part of a working day to understand what on
earth it was supposed to do, and how it accomplished it.  Without
going into details, here are some of the problems:

  * The API is not documented.  It is hard to know which functions are
    meant to be used by client code.  In fact, it is hard to know what
    constitutes client code, since the table code is in the hemlock
    internals package.

  * The documentation for the implementation is totally
    incomprehensible.  

  * The coding style looks old with structs instead of classes,
    old-style loop macros, buffers without fill pointers, etc. 

  * There is (I think) at least one bug that makes the library not
    work if given a string longer than 128 character. 

  * The use (admittedly thread-safe) of special variables to hold
    values to be communicated between certain macros and functions
    looks weird. 

As far as I can tell, this library is used at interaction speed, so it
would not have to be that fast, especially if, as now, maintainability
is a problem.  

A library like this would be useful in McCLIM.  I therefore decided to
rewrite it in a more modern style, less optimized but much shorter and
more idiomatic.  Currently, the library is around 600 lines of
non-comment code.  I think I can do it in 150-200 lines of code tops,
especially if I use the split-sequence function that is floating
around.  Also, I would prepare it for Unicode in Lisp implementations
where the character type is not Unicode, by making it work on
sequences of any type and not just characters, in this case,
presumably fixnums.

I was thinking of integrating this code into the McCLIM distribution.
It could be used for completion in McCLIM, but also (eventually) by
the CLIM-version of Portable Hemlock.

Anyway, this message is just to let you know that I am working on it,
and I expect to have some results in a few days.  It is a matter of a
few hours at most to write it, but I am on vacation, and my ADSL line
was just cut off for an upgrade to 2Mbits/s, so I cannot work
full-time on this right now.  Also, I have more things to do, like
Unicode for McCLIM, Gsharp, Flexichain, etc. 

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------


From strandh at labri.fr  Wed Aug  4 15:50:02 2004
From: strandh at labri.fr (Robert Strandh)
Date: Wed, 4 Aug 2004 17:50:02 +0200
Subject: [phemlock-devel] Re: Hemlock string tables
In-Reply-To: <lh3c33l3oe.fsf@dodo.bluetail.com>
References: <16655.30604.422222.378722@serveur5.labri.fr>
	<lh3c33l3oe.fsf@dodo.bluetail.com>
Message-ID: <16657.1450.989723.571937@serveur5.labri.fr>

Luke Gorrie writes:
 > JFYI: We have three completion algorithms in SLIME and one of them
 > sounds like it does what you mention - it's called "compound prefix
 > matching" in the code.
 > 
 > That's portable Common Lisp in swank.lisp and public domain in case
 > it's any use to you.

That might very well be the case.  Thanks for the hint.  I'll have a
look. 

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------


From strandh at labri.fr  Tue Aug 10 06:48:46 2004
From: strandh at labri.fr (Robert Strandh)
Date: Tue, 10 Aug 2004 08:48:46 +0200
Subject: [phemlock-devel] the CVS mailing list
Message-ID: <16664.28622.771924.637965@serveur5.labri.fr>

It seems like my commits do not show up on the phemlock-cvs mailing
list.  Does anybody know why? 

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------


From strandh at labri.fr  Tue Aug 10 06:53:34 2004
From: strandh at labri.fr (Robert Strandh)
Date: Tue, 10 Aug 2004 08:53:34 +0200
Subject: [phemlock-devel] the CVS mailing list
Message-ID: <16664.28910.650984.869263@serveur5.labri.fr>

Hello, 

I would be interested in getting Portable Hemlock to compile cleanly. 

For that, I need some suggestions as to what to do with certain
subsystems, such as mh mail, news, spell checker, etc.  It seems to me
that either we keep them and try to get them to work, and also
uncomment the corresponding files in hemlock.system, or else we remove
the remaining traces of them in other parts of the system such as the
bindings.lisp file.  

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------


From strandh at labri.fr  Tue Aug 10 09:04:58 2004
From: strandh at labri.fr (Robert Strandh)
Date: Tue, 10 Aug 2004 11:04:58 +0200
Subject: [phemlock-devel] who is interested in working on Portable Hemlock? 
Message-ID: <16664.36794.523111.144165@serveur5.labri.fr>

Hello, 

There are currently 11 people on this mailing list.  I would be
interested in knowing which of these 11 people are here just as
observers, and which ones consider working on the code. 

Here is some preventive maintenance I would like to see done fairly
soon, so that new work would be easier:

   * get rid of compilation notes,

   * replace combinations of %set- function and defsetf with 
     (defun (setf ...) ...),

   * remove files that we do not think might be worthwhile trying to
     support (candidates would be spell checker, mh mail, news,
     scribe-mode, and all the files that are commented out in
     hemlock.system),

   * replace structs with classes in many cases,

   * modularization issues, like associating bindings of a subsystem
     with the code for the subsystem, 

   * probably more stuff I haven't seen yet. 

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------


From strandh at labri.fr  Tue Aug 10 09:34:21 2004
From: strandh at labri.fr (Robert Strandh)
Date: Tue, 10 Aug 2004 11:34:21 +0200
Subject: [phemlock-devel] Re: Thoughts on syntax editing
Message-ID: <16664.38557.172303.977281@serveur5.labri.fr>

Hello, 

Sorr

Brian Mastenbrook writes: 

> I've been thinking for several weeks about the issues involved with
> supporting syntax-aware editing in an editor, particularly Common Lisp
> syntax. Recently I wrote an incremental parser to handle syntax
> coloring for lisppaste, and so that experience has helped me to get a
> better handle on what the relevant issues are to supporting robust
> syntax-aware editing in a text editor.

Gilbert did something similar as demonstrated in the clim/foo.lisp
file in the source tree. 

> What I'm including in the umbrella of syntax-aware editing are the
> following tasks:
> 
> * Coloring the source based on syntactic type
> * Detecting invalid syntax and (a) informing the user and (b) skipping
>   over detectable sections of invalid syntax when performing other
>   commands
> * Detecting sections of comment and strings, and ignoring these for
>   other commands appropriately
> * and, commands which operate on the parse tree of the source,
>   including the C-M-* functions which operate on an entire
> * s-expression

I would like to add detecting and being able to act upon bad
indentation. 

> There are several approaches to making this work. Two of them have
> been done before en masse, with various degrees of success:
> 
> * Force the user to only edit valid syntax, and insert balanced pairs
>   of parens / quotes / comment delimiters. This approach works fine
>   for editing new source but does not work so well when reading in an
>   existing, possibly invalid file, and also can feel much like a
>   straightjacket. It also requires a large amount of adaptation to the
>   environment. The best exemplar of this approach is Interlisp's
>   S-Edit structural editor.

Yes, this has been tried.  The idea seems good, but in practice it
does not work very well. 

> * Maintain the view that the text is merely an octet stream, and
>   locally use regexps to try to determine what syntax things are,
>   setting character properties along the way. This approach is (as far
>   as I understand it) the approach used by Emacs, which leads to
>   massive confusion when editing unbalanced strings and
>   comments. Sometimes Emacs never recovers; especially with CL-style
>   #||# multiline comments.

Right.  Regular expressions are not powerful enough to understand
nested syntax. 

> What I would like to see is a third path: a robust editor which always
> knows the syntactic type of the text, but allows the user to edit code
> as if they were using a plain text editor (or allows a slightly
> smarter mode where balanced syntax is inserted by default). This is
> the holy grail of useful syntactic editing. It's also very
> complicated.

It might not be terribly complicated. 

> My first approach in writing such an editor revolved around viewing a
> buffer as a doubly-linked list of lines, which themselves were a
> doubly linked list composed of segments of text broken up by markers
> (which delineate syntactic type and also include the cursor and
> mark). This approach quickly got very complex: it was too difficult to
> write general text-manipulation routines when text was constantly
> being broken up by markers, and there could be an unbounded number of
> markers between each character.
> 
>  ------------------------------------------------------
> | "a" | #<marker> | "bc" | #<syntactic-marker> | "def" |
>  ------------------------------------------------------
>  The "markers embedded in lines" approach: too complex.

I agree.  This is to hard, and makes other operations harder like text
searching, etc. 

> Dan Barlow had a better idea: keep the text as a doubly linked list of
> strings representing lines, but still use markers to represent the
> beginning and end of sections with a particular syntactic type. In
> other words, the markers would be separated out from the text itself,
> making it possible to have views of the same text with different
> collections of markers. Primitive editing operations would then be
> responsible for making sure that whatever marker invariants were
> necessary were preserved, but this could be simpler when markers are
> disjoint and it's easy to pull out the set of markers affected by an
> operation. This is, I think, the best approach, as it would allow
> keeping the "line" abstraction that hemlock uses as-is.

I suggest a simplification of Gilbert's idea.  He uses an incremental
parser that attaches its state information to the beginning and the
end of each line.  Whenever an altered line is requested for display,
he restarts the parser on the line, sets font information associated
with the line, and then displays the line according to the font
information.  This approach is very robust, and his incremental parser
is surprisingly small and simple.  

The only simplification I suggest is not to mark the font information
on the line, but to invoke the parser each time a line is displayed
(essentially after each keystroke).  This simplification makes it
unnecessary to keep font information of the line up to date. 

> I am envisioning that markers can be used for several different
> purposes - extensible at the user's request. This would include: the
> cursor, the mark, delimiters for syntax types, even the beginning and
> end of the line. Allowing multiple cursor-type markers and setting one
> as primary would allow collaborative editing fairly easily. (On the
> subject of collaborative editing, the disjoint-markers idea is
> probably a good one here too: it means that updates to the text can be
> sent out without syntactic information, and each client updates the
> marker set to account for the new text on its own.)

While multiple marks are a good idea, I suspect it is going to be hard
to use them for syntax-aware editing.  I think it is going to be
difficult to know whether characters you add should be associated with
the syntactic element to the left of the mark or to the right of it.
Also, as you pointed out, inserting a single character such as `"'
might change the syntactic categories of everything that follows.
Re-parsing the line seems like a much simpler and much more robust
approach. 

> The next question of relevance is to figure out how to use these
> markers to implement knowledge of the file syntax. This Ain't Easy,
> for several reasons.
> 
> First, we need to know the raw syntactic role of a various section of
> text. Is it a symbol? A string? A list opening or closing? But if we
> view s-expression editing via the likes of C-M-t as the same problem
> as syntax coloring, this means we need to maintain a nested view of
> the current syntax, because finding the end of the current
> s-expression means understanding not just the raw syntactic role of an
> element but its level of nesting inside other syntactic types. Nesting
> is also necessary for robust coloring in Common Lisp: emacs famously
> fails to handle nested #||# comments (and even sometimes non-nested
> ones), leaving most of your buffer showing in the font lock comment
> color.

Yes, but Gilbert's approach keeps the stack of the parser around as
state information so that solves the problem quite easily. 

> Maintaining this nested view of syntax while allowing the user to
> insert unbalanced syntactic elements is more than merely a SMOP,
> however. A simple insertion of a character might affect the syntactic
> type of the entire rest of the file. Inserting a closing character
> would revert it back then - but possibly not a clean
> reversion. Deleting a character might require restarting the parser at
> some point prior to the previous character or syntactic type change.

That is exactly what Gilbert suggests.  He restarts the parser from
the start of the line that was being modified, but on lines that are
required to be displayed.  In the worst case, this is the entire
buffer, if a modification was made to the first line, and the last
line is on display in some (possibly different) window.  

> Here are some specific examples to think about:
> 
> The cursor is on #\A in "#3A((1) (2) (3))". The user hits the delete
> key.
> 
> The cursor is on the first #\( in "() (+ 1 2))". The user hits the #\(
> key.
> 
> The cursor is at the beginning of the second line of the following
> section of text. The user hits the #\" key.
> 
> -------------------------------
> (format nil
> Mary had a little lamb. Its fleece was as white as ~A. #|"
> '|#snow|) ; |
> -------------------------------
> 
> These examples demonstrate that robust syntax-aware editing is not a
> simple matter of local regular expressions or of a parser that can be
> run until the top-level syntactic role matches what it was before the
> edit. A single edit may actually have a deeper meaning - for instance,
> in the second example, inserting a paren means "insert a level of
> list-ness at this position in the syntactic role stack until an
> unmatched close paren is found on this level".
> 
> To solve the problem of editing unbalanced strings or multiline
> comments, Andreas Fuchs suggested that it be possible to revert the
> syntactic type of an entire region when the insertion of an unmatched
> opening element would otherwise destroy this information. I think this
> is a good idea. It can be implemented by "hiding" one set of markers
> when the unmatched opening element is inserted, and un-hiding them
> after. Any edits the user makes in the meantime can be rectified to
> both the hidden and unhidden syntactic markers, thus meaning zero
> reparsing when the close element is inserted. However, this sounds
> like just the type of SMOP that is far more difficult than it sounds.
> 
> I'm curious to know what other people think about this. It seems to me
> that this is a rather difficult problem when the various nooks and
> crannies of CL syntax are taken into account (unless you're willing to
> live with the possibility of reparsing the entire source file on many
> edits). If this doesn't make any sense at all I'd like to know that
> too.
> 
> Brian

-- 
Robert Strandh

---------------------------------------------------------------------
Greenspun's Tenth Rule of Programming: any sufficiently complicated C
or Fortran program contains an ad hoc informally-specified bug-ridden
slow implementation of half of Common Lisp.
---------------------------------------------------------------------