[cl-ppcre-devel] Buffered multi-line question

Sébastien Saint-Sevin seb-cl-mailist at matchix.com
Mon Oct 11 19:35:41 UTC 2004


> Hi Sébastien!
>
> On Mon, 11 Oct 2004 18:52:56 +0200, Sébastien Saint-Sevin
> <seb-cl-mailist at matchix.com> wrote:
>
> > I'm doing multi-lines regex searches over big files that can't be
> > converted to single string.  So I introduced a kind of buffer that
> > I'm using to search.
> >
> > Now, I need to add a constraint to scan, do-scans & others (in
> > addition to (&key start end)) : I want to be able to specify to the
> > engine that a scan must start before a certain index in the string
> > (to avoid searching further results that will be cancelled later
> > because of my buffered multi-line matching process).
> >
> > Logically, this :must-start-before value correspond to the first
> > line of my buffer. If nothing starts at first line, I need to move
> > the search one line forward, so everything that the engine would
> > match later on in the string is wasted time.
> >
> > How can I do it ?
>
> Have you considered using something like
>
>   (?s:(?=.{n}))<actual-regular-expression>
>
> where n obviously is an integer computed from your constraints above?
> I don't know how this'll behave performance-wise but you could just
> try it... :)
>
> Or have I misunderstood your question? Actually, I'm not sure why the
> END keyword parameter doesn't suffice. Can you give an example?
>

As far as I understand it, (?s:(?=.{n})) will only garantee that at least n
chars are remaining from match-start in the consumed string. This is not
what I want. I want something that garantee that match-start will be before
index n (meaning n'th char in consumed string), wether match-end is before
or after this index n.


> > PS: Edi, if you are back, my previous post is still an open question
> > ;-) (the one with FILTER...)
>
> Yes, I'm back but unfortunately I'm very busy with commercial stuff
> right now. Sorry, filters will have to wait some more.
>
> Cheers,
> Edi.

Here is what I've got right now (it's ok for my needs actually).

(defclass filter (regex)
   ((num :initarg :num
         :accessor num
         :type fixnum
         :documentation "The number of the register this filter refers to.")
    (predicate :initarg :predicate
         :accessor predicate
         :documentation "The predicate to validate the register with"))
   (:documentation "FILTER objects represent the combination of a register
and a predicate.
      This is not available in regex string, but only used in parse tree."))


(defmethod create-matcher-aux ((filter filter) next-fn)
   (declare (type function next-fn))
   ;; the position of the corresponding REGISTER within the whole
   ;; regex; we start to count at 0
   (let ((num (num filter)))
      (lambda (start-pos)
        (declare (type fixnum start-pos))
        (let ((reg-start (svref *reg-starts* num))
              (reg-end (svref *reg-ends* num)))
          ;; only bother to check if the corresponding REGISTER as
          ;; matched successfully already
          (and reg-start
             (funcall (predicate filter) (subseq *string* reg-start
reg-end))
             (funcall next-fn start-pos))))))


ADDED TO (defun convert-aux (parse-tree) ...

   ;; (:FILTER <number> <predicate>)
   ((:filter)
      (let ((backref-number (second parse-tree))
            (predicate (third parse-tree)))
         (declare (type fixnum backref-number))
         (when (or (not (typep backref-number 'fixnum))
               (<= backref-number 0))
            (signal-ppcre-syntax-error
               "Illegal back-reference: ~S"
               parse-tree))
         (unless (or (typep predicate 'symbol) (typep predicate 'function))
            (signal-ppcre-syntax-error
               "Illegal predicate: ~S"
               parse-tree))
         ;; stop accumulating into STARTS-WITH and increase
         ;; MAX-BACK-REF if necessary
         (setq accumulate-start-p nil
            max-back-ref (max (the fixnum max-back-ref)
               backref-number))
         (make-instance 'filter
            ;; we start counting from 0 internally
            :num (1- backref-number)
            :predicate predicate)))


ADDED FOR MY PURPOSES...

(defmethod create-scanner-with-predicate
   ((regex-string string) predicate &key
      case-insensitive-mode
      multi-line-mode
      single-line-mode
      extended-mode
      destructive)
   (declare (optimize speed (safety 0) (space 0) (debug 0)
(compilation-speed 0)
         #+:lispworks (hcl:fixnum-safety 0)))
   (declare (ignore destructive))
   ;; parse the string into a parse-tree and then call CREATE-SCANNER again
   (let* ((*extended-mode-p* extended-mode)
         (quoted-regex-string (if *allow-quoting*
               (quote-sections (clean-comments regex-string extended-mode))
               regex-string))
         (*syntax-error-string* (copy-seq quoted-regex-string))
         (parse-tree (parse-string quoted-regex-string)))
      ;; wrap the result with FILTER to check for predicate
      (create-scanner
         `(:sequence (:register ,(shift-back-reference parse-tree)) (:filter
1 ,predicate))
         :case-insensitive-mode case-insensitive-mode
         :multi-line-mode multi-line-mode
         :single-line-mode single-line-mode
         :destructive t)))

(defun shift-back-reference (tree)
   (if (and (consp tree) (eq (first tree) :back-reference))
      `(:back-reference ,(1+ (second tree)))
      (if (atom tree)
         tree
         (cons (shift-back-reference (car tree))
               (shift-back-reference (cdr tree))))))















More information about the Cl-ppcre-devel mailing list