From harold at hotelling.net Tue Feb 13 00:18:43 2007 From: harold at hotelling.net (Harold Lee) Date: Mon, 12 Feb 2007 16:18:43 -0800 Subject: [cl-ppcre-devel] Quickly finding which alternative matched Message-ID: <45D103E3.2070509@hotelling.net> For an :ALTERNATION, is there an O(1) way to tell which alternate matched? e.g. for a regexp (a)|(b)|(c)|(d)|... I don't want the O(n) performance of scanning the list returned by SCAN or SCAN-TO-STRINGS. Instead, I'd just like a number saying which one matched. I understand that there can normally be nested groups, alternatives that are not groups, and other complications - so maybe a list/vector of groups that were matched would work as a general interface. More background: For a quick and dirty scanner (similar to lex) I thought I could use CL-PPCRE. The "rules" (in lex terminology) are just regular expressions, and so I combine them into a set of alternatives, e.g. \s+ -> whitespace [0-9]+ -> number [a-zA-Z][a-zA-Z0-9]* -> identifier => "(\\s+)|([0-9]+)|([a-zA-Z][a-zA-Z0-9]*)" Actually, I use CL-PPCRE::PARSE-STRING to get a parse tree for each expression, change any :REGISTER tags to :GROUP tags, and combine them under an :ALTERNATION tag: (defun combine-regexps (&rest regexps) "Combine several regexp strings into a CL-PPCRE parse tree of alternatives." ;; All registers are changed into groups, i.e. (x) -> (?:x) in regexp syntax. ;; That keeps the registers from messing up the scanner's expectation that ;; each register is one of the rules, and allows the list of rules to use ;; () instead of (?:) throughout for readability. (let ((registers (mapcar (lambda (r) `(:register ,(subst :group :register (cl-ppcre::parse-string r)))) regexps))) `(:sequence :start-anchor (:group (:alternation , at registers))))) From edi at agharta.de Tue Feb 13 07:58:46 2007 From: edi at agharta.de (Edi Weitz) Date: Tue, 13 Feb 2007 08:58:46 +0100 Subject: [cl-ppcre-devel] Quickly finding which alternative matched In-Reply-To: <45D103E3.2070509@hotelling.net> (Harold Lee's message of "Mon, 12 Feb 2007 16:18:43 -0800") References: <45D103E3.2070509@hotelling.net> Message-ID: On Mon, 12 Feb 2007 16:18:43 -0800, Harold Lee wrote: > For an :ALTERNATION, is there an O(1) way to tell which alternate > matched? e.g. for a regexp (a)|(b)|(c)|(d)|... I don't want the O(n) > performance of scanning the list returned by SCAN or > SCAN-TO-STRINGS. Instead, I'd just like a number saying which one > matched. You could use filters: http://weitz.de/cl-ppcre/#filters However, are you really sure that the O(n) operation of looping through the registers is causing performance problems? Have you profiled the code? This looks like premature optimization to me. I'm pretty sure that whatever you'll do to "improve" this (filters or changing CL-PPCRE internally to return more information for example) you'll end up with something even slower. > For a quick and dirty scanner (similar to lex) I thought I could use > CL-PPCRE. The "rules" (in lex terminology) are just regular > expressions, and so I combine them into a set of alternatives, e.g. > > \s+ -> whitespace > [0-9]+ -> number > [a-zA-Z][a-zA-Z0-9]* -> identifier > > => "(\\s+)|([0-9]+)|([a-zA-Z][a-zA-Z0-9]*)" > > Actually, I use CL-PPCRE::PARSE-STRING to get a parse tree for each > expression, change any :REGISTER tags to :GROUP tags, and combine > them under an :ALTERNATION tag: That doesn't feel right. I wouldn't create regex strings first just to parse them into s-expressions afterwards. Why don't you start with s-expressions right away? Cheers, Edi. From harold at hotelling.net Tue Feb 13 20:08:37 2007 From: harold at hotelling.net (Harold Lee) Date: Tue, 13 Feb 2007 12:08:37 -0800 Subject: [cl-ppcre-devel] Quickly finding which alternative matched In-Reply-To: References: <45D103E3.2070509@hotelling.net> Message-ID: <45D21AC5.8070102@hotelling.net> Edi Weitz wrote: > However, are you really sure that the O(n) operation of looping > through the registers is causing performance problems? Have you > profiled the code? This looks like premature optimization to me. > > I'm pretty sure that whatever you'll do to "improve" this (filters or > changing CL-PPCRE internally to return more information for example) > you'll end up with something even slower. > I am guilty of premature optimization at some level, but I think flex has set the bar high for scanner performance. I'll spend more time examining performance to see if this is really needed. > That doesn't feel right. I wouldn't create regex strings first just > to parse them into s-expressions afterwards. Why don't you start with > s-expressions right away? > I'll change COMBINE-REGEXPS to only call CL-PPCRE::PARSE-STRING for strings (and assume other data is an appropriate s-expression). I'd like to make this very similar to lex / flex in allowing users of this package to use regular expressions. I'm not worried about the parsing performance because I am doing this at compile time (via a macro, DEFSCANNER). From edi at agharta.de Tue Feb 13 21:11:39 2007 From: edi at agharta.de (Edi Weitz) Date: Tue, 13 Feb 2007 22:11:39 +0100 Subject: [cl-ppcre-devel] Re: Question: PCRE -> Thompson NFA Implementation In-Reply-To: <800502.28683.qm@web81202.mail.mud.yahoo.com> (Brent Fulgham's message of "Tue, 13 Feb 2007 12:23:36 -0800 (PST)") References: <800502.28683.qm@web81202.mail.mud.yahoo.com> Message-ID: Hi, first of all: 1. Please use the mailing list. http://weitz.de/cl-ppcre/#mail 2. It's called CL-PPCRE and not PCRE. PCRE is something else... :) On Tue, 13 Feb 2007 12:23:36 -0800 (PST), Brent Fulgham wrote: > Although your Lisp implementation of a Perl-compatible regular > expression engine already handily beats the original Perl version, > it could be modified to be even faster for expressions that do not > contain back-references. See the following article that discusses > the 1960's-era algorithm used in Awk/Grep that discusses this > (http://swtch.com/~rsc/regexp/regexp1.html). > > In my testing, CL-PCRE isn't quite faster than Perl, though it makes > a very creditable showing > (http://shootout.alioth.debian.org/debian/benchmark.php?test=regexdna?=all). > Tcl, which uses the "Thompson DFA" algorithm discussed in the paper > I referenced is nearly an order of magnitude faster on this > benchmark than Perl. > > Please let me know if you have any interest exploring this. I might > try to play with this and see if I can make any headway... I'm aware of the advantages of DFAs over NFAs for "simple" regular expressions, but I shied away from them until now because having two engines in CL-PPCRE would make the code base even bigger and more complicated than it already is. (And having /only/ a DFA engine wouldn't be enough, right? I haven't read the article yet, but I'm pretty sure you'd have to let go some of Perl's more advanced regex features.) Also, although I boast about CL-PPCRE's performance on its web site, I'm not too concerned about its speed anymore. It's fast enough for what I'm doing with it. Having said that, the idea of automatically switching to a fast DFA engine if possible (I guess this is what you want to do) is kind of tempting. If you come up with something that's really a big improvement and adheres to CL-PPCREs current coding and documentation standards, I'd be willing to review and possibly integrate it. Right now, I'm to busy to help with that, though. Cheers, Edi. From edi at agharta.de Tue Feb 13 21:48:32 2007 From: edi at agharta.de (Edi Weitz) Date: Tue, 13 Feb 2007 22:48:32 +0100 Subject: [cl-ppcre-devel] Re: Question: PCRE -> Thompson NFA Implementation In-Reply-To: <800502.28683.qm@web81202.mail.mud.yahoo.com> (Brent Fulgham's message of "Tue, 13 Feb 2007 12:23:36 -0800 (PST)") References: <800502.28683.qm@web81202.mail.mud.yahoo.com> Message-ID: On Tue, 13 Feb 2007 12:23:36 -0800 (PST), Brent Fulgham wrote: > http://shootout.alioth.debian.org/debian/benchmark.php?test=regexdna I took a quick look. That SBCL program doesn't look as if it was written by an experienced Lisper. I hope nobody takes these "Shootouts" seriously, or they should at least attempt not to compare apples with oranges... From edi at agharta.de Wed Feb 14 08:35:31 2007 From: edi at agharta.de (Edi Weitz) Date: Wed, 14 Feb 2007 09:35:31 +0100 Subject: [cl-ppcre-devel] Re: Question: PCRE -> Thompson NFA Implementation In-Reply-To: (Brent Fulgham's message of "Tue, 13 Feb 2007 21:08:33 -0800") References: <800502.28683.qm@web81202.mail.mud.yahoo.com> Message-ID: On Tue, 13 Feb 2007 21:08:33 -0800, Brent Fulgham wrote: > The implementations are all written by volunteers who submit them. > If you see any low hanging fruit, let me know and I'll be glad to > update the tests. For example here's a more compact (and most likely faster) way to read a whole file at once (untested): (defun get-input-chars (stream) (let ((result (make-string (file-length stream)))) (read-sequence result stream) result)) See also . It probably won't change the outcome of the benchmark, but the original function looks pretty weird. Also, if you're really obsessed with speed, CL-PPCRE:ALL-MATCHES certainly isn't the best way to count all matches because it conses up a list you don't need. You should use DO-SCANS and count yourself. Finally, I think that ">[^\\n]*\\n|\\n" should be replaced by "(?m)\\n|^>.*" - maybe not for speed, but for clarity and correctness. > But of course, the tests are all for entertainment purposes only ;-) Yes... :) From sabetts at vcn.bc.ca Fri Feb 16 04:27:45 2007 From: sabetts at vcn.bc.ca (Shawn Betts) Date: Thu, 15 Feb 2007 20:27:45 -0800 Subject: [cl-ppcre-devel] searching backward? Message-ID: <8664a2ohfi.fsf@shitbender.gagrod> Hi folks, Is there a way to search backward through a string? * (ppcre:scan "abc" "111abc1111abc11") 3 6 #() #() * (ppcre:scan "abc" "111abc1111abc11" :start 15 :end 0) NIL I tried just making end < start which doesn't seem to work :). I suppose what i'm looking for is the equivalent of :from-end that most cl sequence functions have. -Shawn From edi at agharta.de Fri Feb 16 08:19:17 2007 From: edi at agharta.de (Edi Weitz) Date: Fri, 16 Feb 2007 09:19:17 +0100 Subject: [cl-ppcre-devel] searching backward? In-Reply-To: <8664a2ohfi.fsf@shitbender.gagrod> (Shawn Betts's message of "Thu, 15 Feb 2007 20:27:45 -0800") References: <8664a2ohfi.fsf@shitbender.gagrod> Message-ID: On Thu, 15 Feb 2007 20:27:45 -0800, Shawn Betts wrote: > Is there a way to search backward through a string? > > * (ppcre:scan "abc" "111abc1111abc11") > > 3 > 6 > #() > #() > * (ppcre:scan "abc" "111abc1111abc11" :start 15 :end 0) > > NIL > > I tried just making end < start which doesn't seem to work :). I > suppose what i'm looking for is the equivalent of :from-end that > most cl sequence functions have. No, there's no such thing as a :FROM-END keyword argument or the equivalent, and I'm also not aware of a regex facility in another programming language which has that. If you really need it, you could loop through the string applying SCAN until it matches with decreasing values for START, but that could be quite inefficient, of course. An alternative would be to work on (REVERSE TARGET) instead of TARGET, but you'll have to think hard how your regular expression should look like in that case - the semantics of things like "*" will certainly be different. Cheers, Edi. From sabetts at vcn.bc.ca Fri Feb 16 14:19:42 2007 From: sabetts at vcn.bc.ca (Shawn Betts) Date: Fri, 16 Feb 2007 06:19:42 -0800 Subject: [cl-ppcre-devel] searching backward? In-Reply-To: (Edi Weitz's message of "Fri, 16 Feb 2007 09:19:17 +0100") References: <8664a2ohfi.fsf@shitbender.gagrod> Message-ID: <861wkqnq0x.fsf@shitbender.gagrod> Edi Weitz writes: > No, there's no such thing as a :FROM-END keyword argument or the > equivalent, and I'm also not aware of a regex facility in another > programming language which has that. Emacs does it somehow. > If you really need it, you could loop through the string applying > SCAN until it matches with decreasing values for START, but that > could be quite inefficient, of course. An alternative would be to > work on (REVERSE TARGET) instead of TARGET, but you'll have to think > hard how your regular expression should look like in that case - the > semantics of things like "*" will certainly be different. I was afaid you'd say that :). I guess I'll look closer into how emacs does it. Thanks! -Shawn From edi at agharta.de Fri Feb 16 11:58:42 2007 From: edi at agharta.de (Edi Weitz) Date: Fri, 16 Feb 2007 12:58:42 +0100 Subject: [cl-ppcre-devel] searching backward? In-Reply-To: <861wkqnq0x.fsf@shitbender.gagrod> (Shawn Betts's message of "Fri, 16 Feb 2007 06:19:42 -0800") References: <8664a2ohfi.fsf@shitbender.gagrod> <861wkqnq0x.fsf@shitbender.gagrod> Message-ID: On Fri, 16 Feb 2007 06:19:42 -0800, Shawn Betts wrote: > Edi Weitz writes: > >> If you really need it, you could loop through the string applying >> SCAN until it matches with decreasing values for START, but that >> could be quite inefficient, of course. An alternative would be to >> work on (REVERSE TARGET) instead of TARGET, but you'll have to >> think hard how your regular expression should look like in that >> case - the semantics of things like "*" will certainly be >> different. > > I was afaid you'd say that :). I guess I'll look closer into how > emacs does it. Of course, if you're really adventurous, you could look at the source code of CREATE-SCANNER-AUX in CL-PPCRE and think about efficient variants of ADVANCE-FN for searching backwards. My guess (from looking at the Emacs C code for two minutes) is that this is more or less what Emacs is doing as well. From edi at agharta.de Fri Feb 16 12:10:28 2007 From: edi at agharta.de (Edi Weitz) Date: Fri, 16 Feb 2007 13:10:28 +0100 Subject: [cl-ppcre-devel] searching backward? In-Reply-To: (Edi Weitz's message of "Fri, 16 Feb 2007 12:58:42 +0100") References: <8664a2ohfi.fsf@shitbender.gagrod> <861wkqnq0x.fsf@shitbender.gagrod> Message-ID: On Fri, 16 Feb 2007 12:58:42 +0100, Edi Weitz wrote: > Of course, if you're really adventurous, you could look at the > source code of CREATE-SCANNER-AUX in CL-PPCRE and think about > efficient variants of ADVANCE-FN for searching backwards. My guess > (from looking at the Emacs C code for two minutes) is that this is > more or less what Emacs is doing as well. I forgot: In an empty Emacs *scratch* buffer type "aaaaaaaa" (eight #\a's) and put point in the middle (after the fourth #\a). Then evaluate (using eval-expression) the following (re-search-forward "a+") This should give you 9 and is what one would expect - the regex engine matches the four #\a's after point. Now put point back in the middle of the string and evaluate (re-search-backward "a+") That'll give you 4, i.e. the engine matches (only) the fourth #\a - a string of length one. I think this confirms my point that Emacs somehow has to go backwards and step by step while the regular expressions themselves still "match forwards" - so to say. It also shows that scanning backwards somehow destroys the semantics of some of the regex constituent - "*" or "+" used to mean "longest possible match", but is a string of length one really the longest match? From edi at agharta.de Thu Feb 22 18:09:59 2007 From: edi at agharta.de (Edi Weitz) Date: Thu, 22 Feb 2007 19:09:59 +0100 Subject: [cl-ppcre-devel] Darcs repositories Message-ID: [My apologies if you get this more than once.] Several people have asked for Darcs repositories of my software. These do exists now: http://common-lisp.net/~loliveira/ediware/ Special thanks to Lu?s Oliveira who made this possible and who maintains the repositories. Cheers, Edi.