From bmastenb at cs.indiana.edu Fri Jul 9 15:39:56 2004 From: bmastenb at cs.indiana.edu (Brian Mastenbrook) Date: Fri, 9 Jul 2004 10:39:56 -0500 (EST) Subject: [phemlock-devel] Thoughts on syntax editing Message-ID: I've been thinking for several weeks about the issues involved with supporting syntax-aware editing in an editor, particularly Common Lisp syntax. Recently I wrote an incremental parser to handle syntax coloring for lisppaste, and so that experience has helped me to get a better handle on what the relevant issues are to supporting robust syntax-aware editing in a text editor. What I'm including in the umbrella of syntax-aware editing are the following tasks: * Coloring the source based on syntactic type * Detecting invalid syntax and (a) informing the user and (b) skipping over detectable sections of invalid syntax when performing other commands * Detecting sections of comment and strings, and ignoring these for other commands appropriately * and, commands which operate on the parse tree of the source, including the C-M-* functions which operate on an entire * s-expression There are several approaches to making this work. Two of them have been done before en masse, with various degrees of success: * Force the user to only edit valid syntax, and insert balanced pairs of parens / quotes / comment delimiters. This approach works fine for editing new source but does not work so well when reading in an existing, possibly invalid file, and also can feel much like a straightjacket. It also requires a large amount of adaptation to the environment. The best exemplar of this approach is Interlisp's S-Edit structural editor. * Maintain the view that the text is merely an octet stream, and locally use regexps to try to determine what syntax things are, setting character properties along the way. This approach is (as far as I understand it) the approach used by Emacs, which leads to massive confusion when editing unbalanced strings and comments. Sometimes Emacs never recovers; especially with CL-style #||# multiline comments. What I would like to see is a third path: a robust editor which always knows the syntactic type of the text, but allows the user to edit code as if they were using a plain text editor (or allows a slightly smarter mode where balanced syntax is inserted by default). This is the holy grail of useful syntactic editing. It's also very complicated. My first approach in writing such an editor revolved around viewing a buffer as a doubly-linked list of lines, which themselves were a doubly linked list composed of segments of text broken up by markers (which delineate syntactic type and also include the cursor and mark). This approach quickly got very complex: it was too difficult to write general text-manipulation routines when text was constantly being broken up by markers, and there could be an unbounded number of markers between each character. ------------------------------------------------------ | "a" | # | "bc" | # | "def" | ------------------------------------------------------ The "markers embedded in lines" approach: too complex. Dan Barlow had a better idea: keep the text as a doubly linked list of strings representing lines, but still use markers to represent the beginning and end of sections with a particular syntactic type. In other words, the markers would be separated out from the text itself, making it possible to have views of the same text with different collections of markers. Primitive editing operations would then be responsible for making sure that whatever marker invariants were necessary were preserved, but this could be simpler when markers are disjoint and it's easy to pull out the set of markers affected by an operation. This is, I think, the best approach, as it would allow keeping the "line" abstraction that hemlock uses as-is. I am envisioning that markers can be used for several different purposes - extensible at the user's request. This would include: the cursor, the mark, delimiters for syntax types, even the beginning and end of the line. Allowing multiple cursor-type markers and setting one as primary would allow collaborative editing fairly easily. (On the subject of collaborative editing, the disjoint-markers idea is probably a good one here too: it means that updates to the text can be sent out without syntactic information, and each client updates the marker set to account for the new text on its own.) The next question of relevance is to figure out how to use these markers to implement knowledge of the file syntax. This Ain't Easy, for several reasons. First, we need to know the raw syntactic role of a various section of text. Is it a symbol? A string? A list opening or closing? But if we view s-expression editing via the likes of C-M-t as the same problem as syntax coloring, this means we need to maintain a nested view of the current syntax, because finding the end of the current s-expression means understanding not just the raw syntactic role of an element but its level of nesting inside other syntactic types. Nesting is also necessary for robust coloring in Common Lisp: emacs famously fails to handle nested #||# comments (and even sometimes non-nested ones), leaving most of your buffer showing in the font lock comment color. Maintaining this nested view of syntax while allowing the user to insert unbalanced syntactic elements is more than merely a SMOP, however. A simple insertion of a character might affect the syntactic type of the entire rest of the file. Inserting a closing character would revert it back then - but possibly not a clean reversion. Deleting a character might require restarting the parser at some point prior to the previous character or syntactic type change. Here are some specific examples to think about: The cursor is on #\A in "#3A((1) (2) (3))". The user hits the delete key. The cursor is on the first #\( in "() (+ 1 2))". The user hits the #\( key. The cursor is at the beginning of the second line of the following section of text. The user hits the #\" key. ------------------------------- (format nil Mary had a little lamb. Its fleece was as white as ~A. #|" '|#snow|) ; | ------------------------------- These examples demonstrate that robust syntax-aware editing is not a simple matter of local regular expressions or of a parser that can be run until the top-level syntactic role matches what it was before the edit. A single edit may actually have a deeper meaning - for instance, in the second example, inserting a paren means "insert a level of list-ness at this position in the syntactic role stack until an unmatched close paren is found on this level". To solve the problem of editing unbalanced strings or multiline comments, Andreas Fuchs suggested that it be possible to revert the syntactic type of an entire region when the insertion of an unmatched opening element would otherwise destroy this information. I think this is a good idea. It can be implemented by "hiding" one set of markers when the unmatched opening element is inserted, and un-hiding them after. Any edits the user makes in the meantime can be rectified to both the hidden and unhidden syntactic markers, thus meaning zero reparsing when the close element is inserted. However, this sounds like just the type of SMOP that is far more difficult than it sounds. I'm curious to know what other people think about this. It seems to me that this is a rather difficult problem when the various nooks and crannies of CL syntax are taken into account (unless you're willing to live with the possibility of reparsing the entire source file on many edits). If this doesn't make any sense at all I'd like to know that too. Brian -- Brian Mastenbrook "God made the natural numbers; http://www.cs.indiana.edu/~bmastenb/ all else is the work of man." bmastenb at cs.indiana.edu -- Leopold Kroneker From a_bakic at yahoo.com Wed Jul 14 13:21:17 2004 From: a_bakic at yahoo.com (Aleksandar Bakic) Date: Wed, 14 Jul 2004 06:21:17 -0700 (PDT) Subject: [phemlock-devel] Re: Thoughts on syntax editing Message-ID: <20040714132117.68090.qmail@web40605.mail.yahoo.com> Hi, Have you looked at the Synthetizer Generator? http://www.grammatech.com/products/sg/overview.html What do you think, would it be possible to make Hemlock come close? Alex __________________________________ Do you Yahoo!? Yahoo! Mail - 50x more storage than other providers! http://promotions.yahoo.com/new_mail From a_bakic at yahoo.com Thu Jul 15 14:52:22 2004 From: a_bakic at yahoo.com (Aleksandar Bakic) Date: Thu, 15 Jul 2004 07:52:22 -0700 (PDT) Subject: [phemlock-devel] Re: Thoughts on syntax editing In-Reply-To: <20040714132117.68090.qmail@web40605.mail.yahoo.com> Message-ID: <20040715145222.85042.qmail@web40605.mail.yahoo.com> I'd like to discuss another issue: using a persistent data structure for the buffer so that some functionality can be obtained for free, such as - unlimited undo - storing the editing history together with the latest version of a file (could be used for better computing of version differences than with cvs-diff) I am still looking for appropriate data structures. A good, brief intro into this area is http://ocw.mit.edu/NR/rdonlyres/Electrical-Engineering-and-Computer-Science/6-854JAdvanced-AlgorithmsFall1999/1ED6CEC5-62B7-460D-8DB9-886B2E26A633/0/scribe5.pdf (See Section 5.2.3.) Is anyone else interested in pursuing this? Any other ideas? Alex __________________________________ Do you Yahoo!? New and Improved Yahoo! Mail - Send 10MB messages! http://promotions.yahoo.com/new_mail From strandh at labri.fr Tue Jul 27 10:53:02 2004 From: strandh at labri.fr (Robert Strandh) Date: Tue, 27 Jul 2004 12:53:02 +0200 Subject: [phemlock-devel] Hemlock buffer API and representation Message-ID: <16646.13326.177432.555006@serveur5.labri.fr> Hello all, [just in case some of you forgot to subscribe to the list, I put you on CC this one time] I have been thinking about the API and the internal representation of a buffer. First, I think we should start by generalizing the contents to Unicode characters. It would be OK with me, if we choose some normalized form and stick to it. I also think that we should not generalize further right now (allowing arbitrary objects), but what I am suggesting below is compatible with such an extension. Concerning the API first, we should get rid of the "doubly-linked list of lines". The API should not mention the existence of lines, except as the text that is contained between two newline characters. It is probably still a good thing to represent the buffer internally as a sequence (but not necessarily a doubly-linked list) of lines, but that's another story. I know Gilbert would like to associate reader state with lines, but as I will try to show below, it does not have to be lines, and in fact, it would not be very optimal to associate it with lines. For the internal representation, I suggest representing the buffer as a cursorchain (see our flexichain paper at the first European workshop on Lisp and Scheme) of lines. There would be many different representations of a line. At first, I thought that it would be a good idea to have a small number of `open' lines represented as cursorchains of Unicode characters (or fixnums if the Lisp implementation does not have Unicode characters), and the others would have some kind of compressed format such as a UTF-8 string or simply a gzipped version of the line. Such transformations would complicate search functions that would have to `open' every line, with possibly huge performance penalties, both in terms of CPU and memory consumption. Instead I suggest that open lines be represented as cursorchains and closed lines as vectors, but in both cases, there can be different element types. At least three different element types would be used: one-byte Unicode character (used when the line contains only Unicode characters from 0 to 255, which includes latin-1), two-byte Unicode character (in most other cases) or four-byte Unicode characters (for serious Unicode users). The buffer would automatically convert from one-byte to two-byte to four-byte whenever an attempt is made to insert a Unicode character that is beyond what can be represented. When a line is closed, it is scanned to see whether a more compact representation can be used. I suggest keeping several lines open and closing them in an LRU-like way. This representation has the interesting property that every character in the buffer is always represented as a valid Unicode character, so that nothing special needs to be done for search functions. One could identify some interesting special cases, such as latin-15, which would normally require more than one byte, except that very few characters need more than one. The search function would be slightly harder for such special cases, but not much. Also, sometime in the future, one might identify line types that can have arbitrary Lisp objects in them. Another obvious property of the representation is that latin-1 texts are very compact indeed, an important special case, I imagine. Now, concerning syntax awareness, I think that what Gilbert is thinking of is the most promising idea so far (with minor modifications). His idea is have a `read' function that can have its state stored, and to store such state in association with (the beginning of) each line in the buffer. Modifying the contents of the buffer would require recomputing read state from the beginning of the line where the modification took place to the end of the (last) window on display. In most cases, where there is only one window on display, there would be a very modest amount of work to do. In the worst case (when the first line of a huge buffer is modified, and there is another window at the end of it), the entire buffer would need to be recomputed. The only objection I have to Gilbert's proposal is that he would like read states to be associated with lines. In fact, if it weren't for memory consumption, we could associate such state with every character in the buffer. Using a line as a unit is just a way of storing such information every so often, often enough to minimize computations and rarely enough that memory consumption remain reasonable. But this is suboptimal because there could be empty lines or lines with just a few characters on them, and other lines that have hundreds of lines. A better idea would be to associate read state with marks (say) every couple hundred characters. Such marks would be added and removed dynamically as the text grows and shrinks so that the distribution of marks remain roughly the same. These read-state marks would probably be stored in another cursorchain specific to each buffer. -- Robert Strandh --------------------------------------------------------------------- Greenspun's Tenth Rule of Programming: any sufficiently complicated C or Fortran program contains an ad hoc informally-specified bug-ridden slow implementation of half of Common Lisp. --------------------------------------------------------------------- From strandh at labri.fr Tue Jul 27 10:54:46 2004 From: strandh at labri.fr (Robert Strandh) Date: Tue, 27 Jul 2004 12:54:46 +0200 Subject: [phemlock-devel] project page Message-ID: <16646.13430.609612.365051@serveur5.labri.fr> Gilbert (or anyone else who knows how to do this), It would be good if the mailing lists were mentioned on the project page, so that people can easily subscribe. -- Robert Strandh --------------------------------------------------------------------- Greenspun's Tenth Rule of Programming: any sufficiently complicated C or Fortran program contains an ad hoc informally-specified bug-ridden slow implementation of half of Common Lisp. ---------------------------------------------------------------------