From edi at agharta.de Fri Oct 1 09:10:09 2004 From: edi at agharta.de (Edi Weitz) Date: Fri, 01 Oct 2004 11:10:09 +0200 Subject: [cl-ppcre-devel] Re: Cl-ppcre usage In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 30 Sep 2004 15:33:40 +0200") References: Message-ID: <87k6uadcsu.fsf@miles.agharta.de> I'll be away for some days so I probably won't be able to reply in detail until next week. Have a nice weekend, Edi. From seb-cl-mailist at matchix.com Mon Oct 11 16:52:56 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Mon, 11 Oct 2004 18:52:56 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question Message-ID: Hi Edi & list, I'm doing multi-lines regex searches over big files that can't be converted to single string. So I introduced a kind of buffer that I'm using to search. Now, I need to add a constraint to scan, do-scans & others (in addition to (&key start end)) : I want to be able to specify to the engine that a scan must start before a certain index in the string (to avoid searching further results that will be cancelled later because of my buffered multi-line matching process). Logically, this :must-start-before value correspond to the first line of my buffer. If nothing starts at first line, I need to move the search one line forward, so everything that the engine would match later on in the string is wasted time. How can I do it ? Cheers, Sebastien. PS: Edi, if you are back, my previous post is still an open question ;-) (the one with FILTER...) From edi at agharta.de Mon Oct 11 17:32:09 2004 From: edi at agharta.de (Edi Weitz) Date: Mon, 11 Oct 2004 19:32:09 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Mon, 11 Oct 2004 18:52:56 +0200") References: Message-ID: Hi S?bastien! On Mon, 11 Oct 2004 18:52:56 +0200, S?bastien Saint-Sevin wrote: > I'm doing multi-lines regex searches over big files that can't be > converted to single string. So I introduced a kind of buffer that > I'm using to search. > > Now, I need to add a constraint to scan, do-scans & others (in > addition to (&key start end)) : I want to be able to specify to the > engine that a scan must start before a certain index in the string > (to avoid searching further results that will be cancelled later > because of my buffered multi-line matching process). > > Logically, this :must-start-before value correspond to the first > line of my buffer. If nothing starts at first line, I need to move > the search one line forward, so everything that the engine would > match later on in the string is wasted time. > > How can I do it ? Have you considered using something like (?s:(?=.{n})) where n obviously is an integer computed from your constraints above? I don't know how this'll behave performance-wise but you could just try it... :) Or have I misunderstood your question? Actually, I'm not sure why the END keyword parameter doesn't suffice. Can you give an example? > PS: Edi, if you are back, my previous post is still an open question > ;-) (the one with FILTER...) Yes, I'm back but unfortunately I'm very busy with commercial stuff right now. Sorry, filters will have to wait some more. Cheers, Edi. From seb-cl-mailist at matchix.com Mon Oct 11 19:35:41 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Mon, 11 Oct 2004 21:35:41 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > Hi S?bastien! > > On Mon, 11 Oct 2004 18:52:56 +0200, S?bastien Saint-Sevin > wrote: > > > I'm doing multi-lines regex searches over big files that can't be > > converted to single string. So I introduced a kind of buffer that > > I'm using to search. > > > > Now, I need to add a constraint to scan, do-scans & others (in > > addition to (&key start end)) : I want to be able to specify to the > > engine that a scan must start before a certain index in the string > > (to avoid searching further results that will be cancelled later > > because of my buffered multi-line matching process). > > > > Logically, this :must-start-before value correspond to the first > > line of my buffer. If nothing starts at first line, I need to move > > the search one line forward, so everything that the engine would > > match later on in the string is wasted time. > > > > How can I do it ? > > Have you considered using something like > > (?s:(?=.{n})) > > where n obviously is an integer computed from your constraints above? > I don't know how this'll behave performance-wise but you could just > try it... :) > > Or have I misunderstood your question? Actually, I'm not sure why the > END keyword parameter doesn't suffice. Can you give an example? > As far as I understand it, (?s:(?=.{n})) will only garantee that at least n chars are remaining from match-start in the consumed string. This is not what I want. I want something that garantee that match-start will be before index n (meaning n'th char in consumed string), wether match-end is before or after this index n. > > PS: Edi, if you are back, my previous post is still an open question > > ;-) (the one with FILTER...) > > Yes, I'm back but unfortunately I'm very busy with commercial stuff > right now. Sorry, filters will have to wait some more. > > Cheers, > Edi. Here is what I've got right now (it's ok for my needs actually). (defclass filter (regex) ((num :initarg :num :accessor num :type fixnum :documentation "The number of the register this filter refers to.") (predicate :initarg :predicate :accessor predicate :documentation "The predicate to validate the register with")) (:documentation "FILTER objects represent the combination of a register and a predicate. This is not available in regex string, but only used in parse tree.")) (defmethod create-matcher-aux ((filter filter) next-fn) (declare (type function next-fn)) ;; the position of the corresponding REGISTER within the whole ;; regex; we start to count at 0 (let ((num (num filter))) (lambda (start-pos) (declare (type fixnum start-pos)) (let ((reg-start (svref *reg-starts* num)) (reg-end (svref *reg-ends* num))) ;; only bother to check if the corresponding REGISTER as ;; matched successfully already (and reg-start (funcall (predicate filter) (subseq *string* reg-start reg-end)) (funcall next-fn start-pos)))))) ADDED TO (defun convert-aux (parse-tree) ... ;; (:FILTER ) ((:filter) (let ((backref-number (second parse-tree)) (predicate (third parse-tree))) (declare (type fixnum backref-number)) (when (or (not (typep backref-number 'fixnum)) (<= backref-number 0)) (signal-ppcre-syntax-error "Illegal back-reference: ~S" parse-tree)) (unless (or (typep predicate 'symbol) (typep predicate 'function)) (signal-ppcre-syntax-error "Illegal predicate: ~S" parse-tree)) ;; stop accumulating into STARTS-WITH and increase ;; MAX-BACK-REF if necessary (setq accumulate-start-p nil max-back-ref (max (the fixnum max-back-ref) backref-number)) (make-instance 'filter ;; we start counting from 0 internally :num (1- backref-number) :predicate predicate))) ADDED FOR MY PURPOSES... (defmethod create-scanner-with-predicate ((regex-string string) predicate &key case-insensitive-mode multi-line-mode single-line-mode extended-mode destructive) (declare (optimize speed (safety 0) (space 0) (debug 0) (compilation-speed 0) #+:lispworks (hcl:fixnum-safety 0))) (declare (ignore destructive)) ;; parse the string into a parse-tree and then call CREATE-SCANNER again (let* ((*extended-mode-p* extended-mode) (quoted-regex-string (if *allow-quoting* (quote-sections (clean-comments regex-string extended-mode)) regex-string)) (*syntax-error-string* (copy-seq quoted-regex-string)) (parse-tree (parse-string quoted-regex-string))) ;; wrap the result with FILTER to check for predicate (create-scanner `(:sequence (:register ,(shift-back-reference parse-tree)) (:filter 1 ,predicate)) :case-insensitive-mode case-insensitive-mode :multi-line-mode multi-line-mode :single-line-mode single-line-mode :destructive t))) (defun shift-back-reference (tree) (if (and (consp tree) (eq (first tree) :back-reference)) `(:back-reference ,(1+ (second tree))) (if (atom tree) tree (cons (shift-back-reference (car tree)) (shift-back-reference (cdr tree)))))) From edi at agharta.de Wed Oct 13 23:05:32 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 01:05:32 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Mon, 11 Oct 2004 21:35:41 +0200") References: Message-ID: On Mon, 11 Oct 2004 21:35:41 +0200, S?bastien Saint-Sevin wrote: > As far as I understand it, (?s:(?=.{n})) will only garantee that at > least n chars are remaining from match-start in the consumed > string. This is not what I want. I want something that garantee that > match-start will be before index n (meaning n'th char in consumed > string), wether match-end is before or after this index n. Well, you could compute n from what you know but that would imply creating a new regular expression for each iteration which is probably not what you want. > Here is what I've got right now (it's ok for my needs actually). I was actually thinking about a simpler version which was just a zero-length thingy that you could insert anywhere in your code and which would call a user-defined function. It'd be more efficient and I think you could still achieve with it what you want. I'll try to release something in the next days. Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 08:46:56 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Thu, 14 Oct 2004 10:46:56 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > > As far as I understand it, (?s:(?=.{n})) will only garantee that at > > least n chars are remaining from match-start in the consumed > > string. This is not what I want. I want something that garantee that > > match-start will be before index n (meaning n'th char in consumed > > string), wether match-end is before or after this index n. > > Well, you could compute n from what you know but that would imply > creating a new regular expression for each iteration which is probably > not what you want. Exactly, in fact I need n to be a parameter of the engine, or a parameter of the compiled regex (like prepared SQL statements !). > > > Here is what I've got right now (it's ok for my needs actually). > > I was actually thinking about a simpler version which was just a > zero-length thingy that you could insert anywhere in your code and > which would call a user-defined function. It'd be more efficient and I > think you could still achieve with it what you want. > > I'll try to release something in the next days. > > Cheers, > Edi. > > I'm not sure to fully understand what you mean. I've coupled filter with registers coz I plan to use it at multiple places in the regex. Ex: If I use two dictionaries, I can say (regex string in double quote (no parse tree here)). (:sequence (:register "\b\w+\b") (:filter 1 check-dic1) " *[0-9]{5} *" (:register "\b\w+\b") (:filter 2 check-dic2)) This would match the full string that consists of two words that are in my dictionaries and that are separated by space(s)-fivedigits-space(s). Plus, I can extract via registers the two elected values. Cheers, Sebastien. From edi at agharta.de Thu Oct 14 12:50:02 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 14:50:02 +0200 Subject: [cl-ppcre-devel] New version 0.9.0 Message-ID: Hi! A new release is available from . Here's the relevant part from the changelog: Version 0.9.0 2004-10-14 Experimental support for "filters" Bugfix for standalone regular expressions (ACCUMULATE-START-P wasn't set to NIL) Have fun, Edi. From edi at agharta.de Thu Oct 14 12:53:50 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 14:53:50 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 14 Oct 2004 10:46:56 +0200") References: Message-ID: On Thu, 14 Oct 2004 10:46:56 +0200, S?bastien Saint-Sevin wrote: > I'm not sure to fully understand what you mean. I've coupled filter > with registers coz I plan to use it at multiple places in the regex. Yes, but the coupling with registers is costly if your filter doesn't use registers. > Ex: If I use two dictionaries, I can say (regex string in double > quote (no parse tree here)). > > (:sequence > (:register "\b\w+\b") > (:filter 1 check-dic1) > " *[0-9]{5} *" > (:register "\b\w+\b") > (:filter 2 check-dic2)) > > This would match the full string that consists of two words that are > in my dictionaries and that are separated by > space(s)-fivedigits-space(s). Plus, I can extract via registers the > two elected values. I've just released a new version which implements a filter variant that should enable you to do this as well. These filters are also (hopefully) properly integrated into the optimization process. Thanks for urging me to do this... :) Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 14:14:27 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Thu, 14 Oct 2004 16:14:27 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: FIRST POINT ----------- > > Thanks for urging me to do this... :) > Thanks for making it that quick. I will try it. SECOND POINT ------------ > Well, you could compute n from what you know but that would imply > creating a new regular expression for each iteration which is probably > not what you want. Can you confirm me that you see no other way of doing it? Right now in my code, I just do the full scan and throw the results away if the start was to far in the string. I've not tried compiling a new regex at each iteration but I guess it will be longer. Cheers, Sebastien. From edi at agharta.de Thu Oct 14 14:22:50 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 16:22:50 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 14 Oct 2004 16:14:27 +0200") References: Message-ID: On Thu, 14 Oct 2004 16:14:27 +0200, S?bastien Saint-Sevin wrote: > SECOND POINT > ------------ >> Well, you could compute n from what you know but that would imply >> creating a new regular expression for each iteration which is >> probably not what you want. > > Can you confirm me that you see no other way of doing it? Right now > in my code, I just do the full scan and throw the results away if > the start was to far in the string. I've not tried compiling a new > regex at each iteration but I guess it will be longer. With the new filter facility you should be able to create a filter which checks the current position against some special variable, say *MAX-START*. You can set *MAX-START* accordingly before each scan but the regular expression will only be compiled once because it doesn't change. Something like that should work, shouldn't it? Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 15:01:39 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Thu, 14 Oct 2004 17:01:39 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > > SECOND POINT > > ------------ > >> Well, you could compute n from what you know but that would imply > >> creating a new regular expression for each iteration which is > >> probably not what you want. > > > > Can you confirm me that you see no other way of doing it? Right now > > in my code, I just do the full scan and throw the results away if > > the start was to far in the string. I've not tried compiling a new > > regex at each iteration but I guess it will be longer. > > With the new filter facility you should be able to create a filter > which checks the current position against some special variable, say > *MAX-START*. You can set *MAX-START* accordingly before each scan but > the regular expression will only be compiled once because it doesn't > change. Something like that should work, shouldn't it? > It should. I'll have to try it. How can I then abort the scan quickly, while avoiding funcalling the filter with the rest of the string ? Something like (setf *start-pos* end-of-string-value) ? Thanks a lot, you're so good ;-) Sebastien. From edi at agharta.de Thu Oct 14 15:32:13 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 17:32:13 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 14 Oct 2004 17:01:39 +0200") References: Message-ID: On Thu, 14 Oct 2004 17:01:39 +0200, S?bastien Saint-Sevin wrote: > How can I then abort the scan quickly, while avoiding funcalling the > filter with the rest of the string ? Something like (setf > *start-pos* end-of-string-value) ? No, never change these internal values unless you're looking for trouble - see docs. Just return NIL from the filter. (I suppose you're talking about the 0.9.0 filters here.) Something like (defvar *max-start-pos* 0) (defun my-filter (pos) (and (< pos *max-start-pos*) pos)) (scan '(:sequence ... (:filter my-filter 0) ...) target) should assure that there's only a match if the position between the first ... and the second ... is below *MAX-START-POS*. The zero is optional but it'll potentially help the regex engine to optimize the scanner depending on the rest of the parse tree. Here's an example for optimization: * (defun my-filter (pos) (print "I was called") pos) MY-FILTER * (cl-ppcre:scan '(:sequence "fo" (:filter my-filter) "bar") "xyzfoobar") "I was called" NIL * (cl-ppcre:scan '(:sequence "fo" (:filter my-filter 0) "bar") "xyzfoobar") NIL Note that in the second example the filter wasn't called at all because due to the zero-length declaration the regex engine was able to determine that the target string must end with "fobar" - which it didn't. In the first example this couldn't be done because there wasn't enough information available. You shouldn't lie to the regex engine, though... :) > Thanks a lot, you're so good ;-) Nah... :) Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 16:18:46 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Thu, 14 Oct 2004 18:18:46 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > > How can I then abort the scan quickly, while avoiding funcalling the > > filter with the rest of the string ? Something like (setf > > *start-pos* end-of-string-value) ? > > No, never change these internal values unless you're looking for > trouble - see docs. Just return NIL from the filter. (I suppose you're > talking about the 0.9.0 filters here.) > > Something like > > (defvar *max-start-pos* 0) > > (defun my-filter (pos) > (and (< pos *max-start-pos*) pos)) > > (scan '(:sequence ... (:filter my-filter 0) ...) target) > > should assure that there's only a match if the position between the > first ... and the second ... is below *MAX-START-POS*. > > The zero is optional but it'll potentially help the regex engine to > optimize the scanner depending on the rest of the parse tree. > The majority of regex I'm using are unfortunately not optimizable. Going back to my buffer. Let's say I'm looking at ten lines at a time. I want start to occurs only at first line and I can do it with filters (that's great !). But the engine will still continue moving forward into the string for the nine remaining lines, and it will call my filter for each position in each line to just get nil everytime. So the question for forcing a full abort immediatly and not calling so many times the filter. In fact this is the case for all filter that once it has returned nil, will return nil forever (and are in a position in the parse tree where they can't be shadowed by some backtracking!). I know it's an optimization problem but I'm running regex on big files... Cheers, Sebastien. From edi at agharta.de Thu Oct 14 17:20:04 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 19:20:04 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 14 Oct 2004 18:18:46 +0200") References: Message-ID: On Thu, 14 Oct 2004 18:18:46 +0200, S?bastien Saint-Sevin wrote: >> (defvar *max-start-pos* 0) >> >> (defun my-filter (pos) >> (and (< pos *max-start-pos*) pos)) >> >> (scan '(:sequence ... (:filter my-filter 0) ...) target) > The majority of regex I'm using are unfortunately not optimizable. Are you sure? > Going back to my buffer. Let's say I'm looking at ten lines at a > time. I want start to occurs only at first line and I can do it with > filters (that's great !). But the engine will still continue moving > forward into the string for the nine remaining lines, and it will > call my filter for each position in each line to just get nil > everytime. I'm sorry but I still don't fully understand your problem. Could you give an example with actual code and data? > So the question for forcing a full abort immediatly and not calling > so many times the filter. In fact this is the case for all filter > that once it has returned nil, will return nil forever (and are in a > position in the parse tree where they can't be shadowed by some > backtracking!). Are you using DO-SCANS or another loop construct? How about this? (defvar *max-start-pos* 0) (defvar *stop-immediately* nil) (defun my-filter (pos) (cond ((< pos *max-start-pos*) pos) (t (setq *stop-immediately* t) nil))) (let (*stop-immediately*) (do-scans (...) (when *stop-immediately* (return)) ;;; your stuff here )) So once *STOP-IMMEDIATELY* is set by your filter the loop will be instantly exited. Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 18:38:17 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Thu, 14 Oct 2004 20:38:17 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > > Going back to my buffer. Let's say I'm looking at ten lines at a > > time. I want start to occurs only at first line and I can do it with > > filters (that's great !). But the engine will still continue moving > > forward into the string for the nine remaining lines, and it will > > call my filter for each position in each line to just get nil > > everytime. > > I'm sorry but I still don't fully understand your problem. Could you > give an example with actual code and data? (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") (defvar *my-scanner* '(:sequence (:filter my-filter 0) :WORD-BOUNDARY (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY)) (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (and (< pos end-of-first-line) pos))) CL-PPCRE 87 > (scan *my-scanner* *my-string*) Called at: 0 0 5 #() #() ==> OK, A match is found on first line. CL-PPCRE 88 > (setf *my-scanner* '(:sequence (:filter my-filter 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY)) (:SEQUENCE (:FILTER MY-FILTER) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-PPCRE 89 > (scan *my-scanner* *my-string*) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 Called at: 18 Called at: 19 Called at: 20 Called at: 21 Called at: 22 Called at: 23 Called at: 24 Called at: 25 Called at: 26 Called at: 27 Called at: 28 Called at: 29 Called at: 30 Called at: 31 Called at: 32 Called at: 33 Called at: 34 Called at: 35 Called at: 36 Called at: 37 Called at: 38 Called at: 39 Called at: 40 Called at: 41 Called at: 42 Called at: 43 Called at: 44 Called at: 45 Called at: 46 Called at: 47 NIL ==> Here is the trouble: how to make the match abort when position 17 is reach. Coz from there, the filter will always returns nil. So the last 30 calls are wasted time. > > So the question for forcing a full abort immediatly and not calling > > so many times the filter. In fact this is the case for all filter > > that once it has returned nil, will return nil forever (and are in a > > position in the parse tree where they can't be shadowed by some > > backtracking!). > > Are you using DO-SCANS or another loop construct? How about this? > No. I think the loop I'm speaking about is created by "insert-advance-fn" & "create-scanner-aux" (while not understanding all the details by now...) Last point, I can't access the position where the match actually has started (the first of the fourth values returned by scan), so I have no way to extract the current global match without using register. Cheers, Sebastien. From edi at agharta.de Thu Oct 14 21:11:14 2004 From: edi at agharta.de (Edi Weitz) Date: Thu, 14 Oct 2004 23:11:14 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 14 Oct 2004 20:38:17 +0200") References: Message-ID: On Thu, 14 Oct 2004 20:38:17 +0200, S?bastien Saint-Sevin wrote: > ==> Here is the trouble: how to make the match abort when position > 17 is reach. Coz from there, the filter will always returns nil. So > the last 30 calls are wasted time. Well, this is Common Lisp... CL-USER> (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") *MY-STRING* CL-USER> (defvar *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary (:greedy-repetition 1 nil :word-char-class) :word-boundary)) *MY-SCANNER* CL-USER> (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (cond ((< pos end-of-first-line) pos) (t (throw 'stop-it nil))))) ; Converted MY-FILTER. MY-FILTER CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 0 5 #() #() CL-USER> (setf *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) (:SEQUENCE (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 NIL > I think the loop I'm speaking about is created by "insert-advance-fn" Yes. It's the normal loop that advances through the regular expression. > Last point, I can't access the position where the match actually has > started (the first of the fourth values returned by scan), so I have > no way to extract the current global match without using register. Sure you can: CL-USER> (let (match-start) (defun set-match-start (pos) (setq match-start pos)) (defun show-match-start (pos) (format t "Match start is ~A, pos is ~A~%" match-start pos) pos)) ; Converted SET-MATCH-START. ; Converted SHOW-MATCH-START. SHOW-MATCH-START CL-USER> (setf *my-scanner* '(:sequence (:filter set-match-start 0) "abc" (:filter show-match-start 0) (:alternation #\x #\y))) (:SEQUENCE (:FILTER SET-MATCH-START 0) "abc" (:FILTER SHOW-MATCH-START 0) (:ALTERNATION #\x #\y)) CL-USER> (scan *my-scanner* "abczabcabcx") Match start is 0, pos is 3 Match start is 4, pos is 7 Match start is 7, pos is 10 7 11 #() #() Just make sure SET-MATCH-START is at the very beginning of your regular expression and not within a group or alternation or somesuch. Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 21:34:30 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Thu, 14 Oct 2004 23:34:30 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > > ==> Here is the trouble: how to make the match abort when position > > 17 is reach. Coz from there, the filter will always returns nil. So > > the last 30 calls are wasted time. > > Well, this is Common Lisp... > > CL-USER> (defvar *my-string* "line1 word1 word2 > line2 word1 word2 > line3 word1 word2") > *MY-STRING* > CL-USER> (defvar *my-scanner* > '(:sequence > (:filter my-filter 0) > :word-boundary > (:greedy-repetition 1 nil :word-char-class) > :word-boundary)) > *MY-SCANNER* > CL-USER> (let ((end-of-first-line 17)) > (defun my-filter (pos) > (format t "Called at: ~A~%" pos) > (cond ((< pos end-of-first-line) > pos) > (t > (throw 'stop-it nil))))) > ; Converted MY-FILTER. > MY-FILTER > CL-USER> (catch 'stop-it > (scan *my-scanner* *my-string*)) > Called at: 0 > 0 > 5 > #() > #() > CL-USER> (setf *my-scanner* > '(:sequence > (:filter my-filter 0) > :word-boundary > "line2" > (:greedy-repetition 1 nil :word-char-class) > :word-boundary)) > (:SEQUENCE (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" > (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) > CL-USER> (catch 'stop-it > (scan *my-scanner* *my-string*)) > Called at: 0 > Called at: 1 > Called at: 2 > Called at: 3 > Called at: 4 > Called at: 5 > Called at: 6 > Called at: 7 > Called at: 8 > Called at: 9 > Called at: 10 > Called at: 11 > Called at: 12 > Called at: 13 > Called at: 14 > Called at: 15 > Called at: 16 > Called at: 17 > NIL > Throw & Catch, of course. I'm just not very familiar with this kind of big jumps. I should !!!! > > I think the loop I'm speaking about is created by "insert-advance-fn" > > Yes. It's the normal loop that advances through the regular > expression. > > > Last point, I can't access the position where the match actually has > > started (the first of the fourth values returned by scan), so I have > > no way to extract the current global match without using register. > > Sure you can: > > CL-USER> (let (match-start) > (defun set-match-start (pos) > (setq match-start pos)) > (defun show-match-start (pos) > (format t "Match start is ~A, pos is ~A~%" > match-start pos) > pos)) > ; Converted SET-MATCH-START. > ; Converted SHOW-MATCH-START. > SHOW-MATCH-START > CL-USER> (setf *my-scanner* '(:sequence (:filter set-match-start 0) > "abc" > (:filter show-match-start 0) > (:alternation #\x #\y))) > (:SEQUENCE (:FILTER SET-MATCH-START 0) "abc" (:FILTER > SHOW-MATCH-START 0) > (:ALTERNATION #\x #\y)) > CL-USER> (scan *my-scanner* "abczabcabcx") > Match start is 0, pos is 3 > Match start is 4, pos is 7 > Match start is 7, pos is 10 > 7 > 11 > #() > #() > > Just make sure SET-MATCH-START is at the very beginning of your > regular expression and not within a group or alternation or somesuch. > It just add a little work to craft the parse tree but that's OK. It seems that filters are really powerful !!! I've got everything I need for now. I will try all that & will give you some feedback when it's done in a few days. Finally, I just want to thank you very much, Edi, for all your help & work. Cheers, Sebastien. From edi at agharta.de Thu Oct 14 22:00:35 2004 From: edi at agharta.de (Edi Weitz) Date: Fri, 15 Oct 2004 00:00:35 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: =?iso-8859-1?q?=28S=E9bastien?= Saint-Sevin's message of "Thu, 14 Oct 2004 23:34:30 +0200") References: Message-ID: On Thu, 14 Oct 2004 23:34:30 +0200, S?bastien Saint-Sevin wrote: > Throw & Catch, of course. I'm just not very familiar with this kind > of big jumps. I should !!!! BTW, I don't know if these were just examples or if they're actually related to your real problem but in this specific case there's certainly room for improvement: CL-USER> (defvar *my-string* "line1 word1 word2 line2 word1 word2 line3 word1 word2") *MY-STRING* CL-USER> (let ((end-of-first-line 17)) (defun my-filter (pos) (format t "Called at: ~A~%" pos) (cond ((< pos end-of-first-line) pos) (t (throw 'stop-it nil))))) MY-FILTER CL-USER> (defvar *my-scanner* '(:sequence (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) *MY-SCANNER* CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 1 Called at: 2 Called at: 3 Called at: 4 Called at: 5 Called at: 6 Called at: 7 Called at: 8 Called at: 9 Called at: 10 Called at: 11 Called at: 12 Called at: 13 Called at: 14 Called at: 15 Called at: 16 Called at: 17 NIL CL-USER> (setf *my-scanner* '(:sequence :multi-line-mode-p :start-anchor (:filter my-filter 0) :word-boundary "line2" (:greedy-repetition 1 nil :word-char-class) :word-boundary)) (:SEQUENCE :MULTI-LINE-MODE-P :START-ANCHOR (:FILTER MY-FILTER 0) :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) :WORD-BOUNDARY) CL-USER> (catch 'stop-it (scan *my-scanner* *my-string*)) Called at: 0 Called at: 18 NIL Not sure if that's relevant, though. Cheers, Edi. From seb-cl-mailist at matchix.com Thu Oct 14 22:24:41 2004 From: seb-cl-mailist at matchix.com (=?iso-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Fri, 15 Oct 2004 00:24:41 +0200 Subject: [cl-ppcre-devel] Buffered multi-line question In-Reply-To: Message-ID: > > Throw & Catch, of course. I'm just not very familiar with this kind > > of big jumps. I should !!!! > > BTW, I don't know if these were just examples or if they're actually > related to your real problem but in this specific case there's > certainly room for improvement: > > CL-USER> (defvar *my-string* "line1 word1 word2 > line2 word1 word2 > line3 word1 word2") > *MY-STRING* > CL-USER> (let ((end-of-first-line 17)) > (defun my-filter (pos) > (format t "Called at: ~A~%" pos) > (cond ((< pos end-of-first-line) > pos) > (t > (throw 'stop-it nil))))) > MY-FILTER > CL-USER> (defvar *my-scanner* > '(:sequence > (:filter my-filter 0) > :word-boundary > "line2" > (:greedy-repetition 1 nil :word-char-class) > :word-boundary)) > *MY-SCANNER* > CL-USER> (catch 'stop-it > (scan *my-scanner* *my-string*)) > Called at: 0 > Called at: 1 > Called at: 2 > Called at: 3 > Called at: 4 > Called at: 5 > Called at: 6 > Called at: 7 > Called at: 8 > Called at: 9 > Called at: 10 > Called at: 11 > Called at: 12 > Called at: 13 > Called at: 14 > Called at: 15 > Called at: 16 > Called at: 17 > NIL > CL-USER> (setf *my-scanner* > '(:sequence > :multi-line-mode-p > :start-anchor > (:filter my-filter 0) > :word-boundary > "line2" > (:greedy-repetition 1 nil :word-char-class) > :word-boundary)) > (:SEQUENCE :MULTI-LINE-MODE-P :START-ANCHOR (:FILTER MY-FILTER 0) > :WORD-BOUNDARY "line2" (:GREEDY-REPETITION 1 NIL :WORD-CHAR-CLASS) > :WORD-BOUNDARY) > CL-USER> (catch 'stop-it > (scan *my-scanner* *my-string*)) > Called at: 0 > Called at: 18 > NIL > > Not sure if that's relevant, though. It was just an example. I chose "line2" as the first word that wasn't in line1 so that the match fails. Usually, I use as much anchors as I can in the regex coz it considerably decrease the number of backtracking with quantifiers of all kind. Cheers, Sebastien.