From klaus at harbo.net Sun May 2 20:46:23 2004 From: klaus at harbo.net (Klaus Harbo) Date: Sun, 02 May 2004 22:46:23 +0200 Subject: [cl-ppcre-devel] Multi-line-mode Message-ID: <40955E1F.1060508@harbo.net> I have been using cl-ppcre for quite some time - great program library, thx! - but I am now looking to match multi-line data, and I?m having some difficulties... I would expect (scan (create-scanner ".+" :multi-line-mode t) (concatenate 'string '(#\a #\Newline #\b))) to match all three characters, but it matches only #\a. I downloaded cl-ppcre from CVS just now to check that this occurs in the most recent version; I'm using Lispworks 4.3.6. CMUCL 18d seems to behave the same way. Can I make :everything (".") match multiple lines of input, and if so, how? Any help much appreciated. best, -Klaus. From edi at agharta.de Sun May 2 21:36:50 2004 From: edi at agharta.de (Edi Weitz) Date: Sun, 02 May 2004 23:36:50 +0200 Subject: [cl-ppcre-devel] Multi-line-mode In-Reply-To: <40955E1F.1060508@harbo.net> (Klaus Harbo's message of "Sun, 02 May 2004 22:46:23 +0200") References: <40955E1F.1060508@harbo.net> Message-ID: Hi Klaus! On Sun, 02 May 2004 22:46:23 +0200, Klaus Harbo wrote: > I have been using cl-ppcre for quite some time - great program > library, thx! - but I am now looking to match multi-line data, and > I?m having some difficulties... > > I would expect > > (scan > (create-scanner ".+" :multi-line-mode t) > (concatenate 'string '(#\a #\Newline #\b))) > > to match all three characters, but it matches only #\a. I > downloaded cl-ppcre from CVS just now to check that this occurs in > the most recent version; I don't really use the CVS repository at common-lisp.net. The most recent version is usually the one available from weitz.de. > I'm using Lispworks 4.3.6. CMUCL 18d seems to behave the same way. > > Can I make :everything (".") match multiple lines of input, and if > so, how? What you're looking for is "single-line mode", not "multi-line mode", actually. "single-line mode" means to treat the whole target string as if it were a single line and thus the dot also matches newlines (which it normally doesn't). "multi-line mode" means to treat the whole target string as if it consisted of multiple lines and thus the anchors "^" and "$" also match around newlines in the middle of the string (but see also "\A", "\Z", and "\z"). These two features are orthogonal. * (scan "(?m).+" "a b") 0 1 #() #() * (scan "(?s).+" "a b") 0 3 #() #() For the gory details see the Perl documentation, specifically "man perlre" - remember that CL-PPCRE tries hard to be compatible with Perl. HTH, Edi. From edi at agharta.de Wed May 19 05:26:57 2004 From: edi at agharta.de (Edi Weitz) Date: Wed, 19 May 2004 07:26:57 +0200 Subject: [cl-ppcre-devel] New release 0.7.7 Message-ID: Hi! A new release is available from . Here's the relevant part from the changelog: Version 0.7.7 2004-05-19 Fixed bug in NEWLINE-SKIPPER (caught by RegexCoach user Thomas-Paz Hartman) Added doc strings for PPCRE-SYNTAX-ERROR and friends (after playing with slime-apropos-package) Added hyperdoc support Have fun, Edi. From edi at agharta.de Wed May 19 10:19:00 2004 From: edi at agharta.de (Edi Weitz) Date: Wed, 19 May 2004 12:19:00 +0200 Subject: [cl-ppcre-devel] Re: Programming Problem - Need your CL-PPCRE Experience In-Reply-To: <200405152152.25678.tjs_ng@yahoo.de> (Thomas Schilling's message of "Sat, 15 May 2004 21:52:25 +0200") References: <200405152152.25678.tjs_ng@yahoo.de> Message-ID: Hi Thomas! Sorry for the delay. On Sat, 15 May 2004 21:52:25 +0200, Thomas Schilling wrote: > I'm currently writing a pattern matcher for trees. This is > (hopefully) going to become a tool for easily extracting information > from tree-structured documents, specifically XML/HTML files. (Yes, I > know of SAX and XSLT and XPath but I think this can become more > powerful--I plan to allow embeddable code. Nevertheless it could be > added to e.g. CL-XML.) > > In fact it's a regular expression matcher that two-dimensionally > matches on lists. (car is considered to be the node's name, > i.e. tag-name, and the cdr it's children). So that's the reason why > I'm asking you. > > So here's my problem: > I, well, stole some ideas from you. Because we need heavy > backtracking and looking ahead I implemented it (just like you, I > think) largely via closures, i.e. each rule is matched via a > function, and the rules are linked via a next-rule (function-) > variable that is put inside each closure. So e.g. the repetion > function can check by itself if the rest of the pattern matches when > the number of repetitions is tried. This works great except in the > folloing case: > > When I have (in perl-notation) "abaa" =~ /^(a|ab)*a/ the match > failes. That's because the alternative which has only the > information of the two choices "a" and "ab" but no successor > information (because it can have two--the alternative itself or the > "a" after the repetition) succeeds when matching the first > alternative and never tries to match the second, although then the > repetition could match. > > I currently have no real solution to this problem. How did you solve > this? I looked it up in your code but I couldn't find it (maybe it's > masked by the many optimizations?). > > I would be very thankful if you could help me. > > Ok, so here's the code: http://www.inf.tu-dresden.de/~s3815210/lisp > tree-match.lisp. You can simply C-c C-k it. and then C-x C-e the > (test-all) at the end of the file. It should print the one failing > test. The interesting functions are probably OR-M and > GREEDY-REPEAT-M. The problem is mentioned in line 286 in the > PARSE-PATTERN function. I only looked briefly at your code 'cause I'm busy with other things but I /think/ the problem is in OR-M: (defun or-m (options next-rule) (declare (type function next-rule)) (matcher-fn (do ((r options (cdr r))) ((null r) nil) (multiple-value-bind (pos inp) (funcall (the function (car r)) input position) (when pos (multiple-value-bind (pos2 inp2) (funcall next-rule inp pos) (when pos2 (return (values position inp2))))))))) Here you try each function in OPTIONS in turn and if one of them succeeds you immediately call NEXT-RULE so there's no chance for backtracking. What you have to do is to construct a new function for each element in OPTIONS which itself knows how to proceed (namely via NEXT-RULE) and try these new functions instead of the ones in OPTIONS. FWIW, here's the relevant code from CL-PPCRE: (defmethod create-matcher-aux ((alternation alternation) next-fn) ;; first create closures for all alternations of ALTERNATION (let ((all-matchers (mapcar #'(lambda (choice) (create-matcher-aux choice next-fn)) (choices alternation)))) ;; now create a closure which checks if one of the closures ;; created above can succeed (lambda (start-pos) (declare (type fixnum start-pos)) (loop for matcher in all-matchers thereis (funcall (the function matcher) start-pos))))) I don't think this is masked by too many optimizations... :) The important part is the creation of ALL-MATCHERS from (CHOICES ALTERNATION) (which would be your OPTIONS). Does that help? This is just a wild guess from looking at your code for a couple of minutes. If that doesn't help feel free to ask me again. > Please ask me if you have any questions. > > If you think of a place to which should post this please tell me. I've sent a copy to the cl-ppcre-devel mailing list. We should take further discussion there so they'll be archived (if you don't mind). HTH, Edi.