From edi at agharta.de Tue Sep 2 08:48:22 2008 From: edi at agharta.de (Edi Weitz) Date: Tue, 02 Sep 2008 10:48:22 +0200 Subject: [cl-ppcre-devel] New release 2.0.1 Message-ID: ChangeLog: Version 2.0.1 2008-09-02 Fixed faulty declaration (caught by Brent Fulgham) Download: http://weitz.de/files/cl-ppcre.tar.gz From akopa.gmane.poster at gmail.com Sun Sep 28 18:02:26 2008 From: akopa.gmane.poster at gmail.com (Matthew D. Swank) Date: Sun, 28 Sep 2008 13:02:26 -0500 Subject: [cl-ppcre-devel] Matching on very long strings. Message-ID: <20080928130226.338d58e4@gmail.com> I wrote a small character based regex engine specifically to support a lexer. However, the last thing lisp needs is another regex package. cl-ppcre is so nice I would like to adopt it instead. However, since scanners try to find the first place in a string that provides a match, it can be impractical to use on very long strings. Is it possible to create a scanner that only matches at the start index? Matt -- "You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein. From hans at huebner.org Sun Sep 28 18:19:21 2008 From: hans at huebner.org (=?ISO-8859-1?Q?Hans_H=FCbner?=) Date: Sun, 28 Sep 2008 14:19:21 -0400 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: <20080928130226.338d58e4@gmail.com> References: <20080928130226.338d58e4@gmail.com> Message-ID: On Sun, Sep 28, 2008 at 14:02, Matthew D. Swank wrote: > Is it possible to create a scanner that only matches at the start index? I might be confused, but isn't '^' what you need? -Hans From akopa.gmane.poster at gmail.com Sun Sep 28 19:00:25 2008 From: akopa.gmane.poster at gmail.com (Matthew D. Swank) Date: Sun, 28 Sep 2008 14:00:25 -0500 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: References: <20080928130226.338d58e4@gmail.com> Message-ID: <20080928140025.53e13596@gmail.com> On Sun, 28 Sep 2008 14:19:21 -0400 "Hans H?bner" wrote: > On Sun, Sep 28, 2008 at 14:02, Matthew D. Swank > wrote: > > Is it possible to create a scanner that only matches at the start > > index? > > I might be confused, but isn't '^' what you need? > Probably, I just hadn't thought of it. I could pre-pend a '^' to each choice. Thanks, Matt -- "You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein. From akopa.gmane.poster at gmail.com Sun Sep 28 19:15:40 2008 From: akopa.gmane.poster at gmail.com (Matthew D. Swank) Date: Sun, 28 Sep 2008 14:15:40 -0500 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: References: <20080928130226.338d58e4@gmail.com> Message-ID: <20080928141540.3f60eceb@gmail.com> On Sun, 28 Sep 2008 14:19:21 -0400 "Hans H?bner" wrote: > On Sun, Sep 28, 2008 at 14:02, Matthew D. Swank > wrote: > > Is it possible to create a scanner that only matches at the start > > index? > > I might be confused, but isn't '^' what you need? > I tried using a contruct like `(:sequence :start-anchor (:regex ,regex)) where regex is a pcre string, but matching still takes for ever (as in I gave up after 10 min) when slurping a moderately sized file (400k). Note, matching works fine for files under 1k, or if I break it up into lines for line oriented input. Matt -- "You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein. From edi at agharta.de Sun Sep 28 19:31:05 2008 From: edi at agharta.de (Edi Weitz) Date: Sun, 28 Sep 2008 21:31:05 +0200 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: <20080928141540.3f60eceb@gmail.com> (Matthew D. Swank's message of "Sun, 28 Sep 2008 14:15:40 -0500") References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> Message-ID: On Sun, 28 Sep 2008 14:15:40 -0500, "Matthew D. Swank" wrote: > I tried using a contruct like `(:sequence :start-anchor (:regex > ,regex)) where regex is a pcre string, but matching still takes for > ever (as in I gave up after 10 min) when slurping a moderately sized > file (400k). Note, matching works fine for files under 1k, or if I > break it up into lines for line oriented input. Show us the regex you were using and some test data and then maybe we can help you to optimize it. I suppose you read this? http://weitz.de/cl-ppcre/#blabla Edi. From akopa.gmane.poster at gmail.com Sun Sep 28 20:28:07 2008 From: akopa.gmane.poster at gmail.com (Matthew D. Swank) Date: Sun, 28 Sep 2008 15:28:07 -0500 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> Message-ID: <20080928152807.6f375a0c@gmail.com> On Sun, 28 Sep 2008 21:31:05 +0200 Edi Weitz wrote: > On Sun, 28 Sep 2008 14:15:40 -0500, "Matthew D. Swank" > wrote: > > > I tried using a contruct like `(:sequence :start-anchor (:regex > > ,regex)) where regex is a pcre string, but matching still takes for > > ever (as in I gave up after 10 min) when slurping a moderately sized > > file (400k). Note, matching works fine for files under 1k, or if I > > break it up into lines for line oriented input. > > Show us the regex you were using and some test data and then maybe we > can help you to optimize it. > > I suppose you read this? > > http://weitz.de/cl-ppcre/#blabla > > Edi. > _______________________________________________ > cl-ppcre-devel site list > cl-ppcre-devel at common-lisp.net > http://common-lisp.net/mailman/listinfo/cl-ppcre-devel Well the regexes are defined in the lexers in this file: http://common-lisp.net/~mswank/apache-ppcre.lisp The lexer api is in this file: http://common-lisp.net/~mswank/cl-ppcre-lexer.lisp Finally, the log file I'm lexing: http://lcpug.asternix.com/pub/Main/ApacheLogProject/access.log Compare (with-open-file (in "access.log") (let ((foo (stream-gen *apache-pcrelex-line* in))) (time (loop :for x := (funcall foo) :unless x :return nil)))) with (with-open-file (in "access.log") (let ((foo (stream-gen *apache-pcrelex* in))) (time (loop :for x := (funcall foo) :unless x :return nil)))) When I slurp the entire file into a string the matches seem to be taking about a tenth of a second for each token. Matt -- "You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein. From edi at agharta.de Mon Sep 29 09:27:47 2008 From: edi at agharta.de (Edi Weitz) Date: Mon, 29 Sep 2008 11:27:47 +0200 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: <20080928152807.6f375a0c@gmail.com> (Matthew D. Swank's message of "Sun, 28 Sep 2008 15:28:07 -0500") References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> <20080928152807.6f375a0c@gmail.com> Message-ID: On Sun, 28 Sep 2008 15:28:07 -0500, "Matthew D. Swank" wrote: > Well the regexes are defined in the lexers in this file: > http://common-lisp.net/~mswank/apache-ppcre.lisp > > The lexer api is in this file: > http://common-lisp.net/~mswank/cl-ppcre-lexer.lisp > > Finally, the log file I'm lexing: > http://lcpug.asternix.com/pub/Main/ApacheLogProject/access.log > > Compare > (with-open-file (in "access.log") > (let ((foo (stream-gen *apache-pcrelex-line* in))) > (time (loop :for x := (funcall foo) > :unless x :return nil)))) > > with > > (with-open-file (in "access.log") > (let ((foo (stream-gen *apache-pcrelex* in))) > (time (loop :for x := (funcall foo) > :unless x :return nil)))) > > When I slurp the entire file into a string the matches seem to be > taking about a tenth of a second for each token. Sorry, I don't have the time to read the entire application right now. Can you boil this down to a single application of PPCRE:SCAN which is too slow? Thanks, Edi. From seb-cl-mailist at matchix.com Mon Sep 29 11:16:29 2008 From: seb-cl-mailist at matchix.com (=?ISO-8859-1?Q?S=E9bastien_Saint-Sevin?=) Date: Mon, 29 Sep 2008 13:16:29 +0200 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: <20080928152807.6f375a0c@gmail.com> References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> <20080928152807.6f375a0c@gmail.com> Message-ID: <48E0B90D.7010204@matchix.com> Hi Matthew, You are probably not doing the same thing with the "line oriented approach" and the "full file in one string" approach. With full file in, if not taking care of stopping the scan at end of each line (if you want a line by line scanning as you suggest by trying such an approach as well), I guess your are scanning until the end of the full string for each line (which for sure is very expensive). But that's just a guess as I've only had a very quick look to your code :-) Cheers, Sebastien. Matthew D. Swank a ?crit : > On Sun, 28 Sep 2008 21:31:05 +0200 > Edi Weitz wrote: > >> On Sun, 28 Sep 2008 14:15:40 -0500, "Matthew D. Swank" >> wrote: >> >>> I tried using a contruct like `(:sequence :start-anchor (:regex >>> ,regex)) where regex is a pcre string, but matching still takes for >>> ever (as in I gave up after 10 min) when slurping a moderately sized >>> file (400k). Note, matching works fine for files under 1k, or if I >>> break it up into lines for line oriented input. >> Show us the regex you were using and some test data and then maybe we >> can help you to optimize it. >> >> I suppose you read this? >> >> http://weitz.de/cl-ppcre/#blabla >> >> Edi. >> _______________________________________________ >> cl-ppcre-devel site list >> cl-ppcre-devel at common-lisp.net >> http://common-lisp.net/mailman/listinfo/cl-ppcre-devel > > Well the regexes are defined in the lexers in this file: > http://common-lisp.net/~mswank/apache-ppcre.lisp > > The lexer api is in this file: > http://common-lisp.net/~mswank/cl-ppcre-lexer.lisp > > Finally, the log file I'm lexing: > http://lcpug.asternix.com/pub/Main/ApacheLogProject/access.log > > Compare > (with-open-file (in "access.log") > (let ((foo (stream-gen *apache-pcrelex-line* in))) > (time (loop :for x := (funcall foo) > :unless x :return nil)))) > > with > > (with-open-file (in "access.log") > (let ((foo (stream-gen *apache-pcrelex* in))) > (time (loop :for x := (funcall foo) > :unless x :return nil)))) > > When I slurp the entire file into a string the matches seem to be > taking about a tenth of a second for each token. > > > Matt > From akopa.gmane.poster at gmail.com Mon Sep 29 16:04:09 2008 From: akopa.gmane.poster at gmail.com (Matthew D. Swank) Date: Mon, 29 Sep 2008 11:04:09 -0500 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: <48E0B90D.7010204@matchix.com> References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> <20080928152807.6f375a0c@gmail.com> <48E0B90D.7010204@matchix.com> Message-ID: <20080929110409.3b51e0cc@gmail.com> On Mon, 29 Sep 2008 13:16:29 +0200 S?bastien Saint-Sevin wrote: > Hi Matthew, > > You are probably not doing the same thing with the "line oriented > approach" and the "full file in one string" approach. > > With full file in, if not taking care of stopping the scan at end of > each line (if you want a line by line scanning as you suggest by > trying such an approach as well), I guess your are scanning until the > end of the full string for each line (which for sure is very > expensive). > > But that's just a guess as I've only had a very quick look to your > code :-) > > Cheers, > Sebastien. Well, the lexer code is line agnostic; i.e. you could replace 'end of each line' with any old stop. What it does is adjust the start index as it matches tokens. One thing I did notice is that I read the file into an adjustable vector, and that is the string I pass to the scanners. I suppose ppcre has to coerce that every time a scanner runs? Matt -- "You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein. From ctdean at sokitomi.com Mon Sep 29 17:37:00 2008 From: ctdean at sokitomi.com (Chris Dean) Date: Mon, 29 Sep 2008 10:37:00 -0700 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: <20080929110409.3b51e0cc@gmail.com> (Matthew D. Swank's message of "Mon, 29 Sep 2008 11:04:09 -0500") References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> <20080928152807.6f375a0c@gmail.com> <48E0B90D.7010204@matchix.com> <20080929110409.3b51e0cc@gmail.com> Message-ID: "Matthew D. Swank" writes: > One thing I did notice is that I read the file into an adjustable > vector, and that is the string I pass to the scanners. I suppose ppcre > has to coerce that every time a scanner runs? Yes, scan needs a simple-string. From the SCAN docs: target-string will be coerced to a simple string if it isn't one already. Cheers, Chris Dean From akopa.gmane.poster at gmail.com Mon Sep 29 22:21:33 2008 From: akopa.gmane.poster at gmail.com (Matthew D. Swank) Date: Mon, 29 Sep 2008 17:21:33 -0500 Subject: [cl-ppcre-devel] Matching on very long strings. In-Reply-To: References: <20080928130226.338d58e4@gmail.com> <20080928141540.3f60eceb@gmail.com> <20080928152807.6f375a0c@gmail.com> <48E0B90D.7010204@matchix.com> <20080929110409.3b51e0cc@gmail.com> Message-ID: <20080929172133.2241509a@gmail.com> On Mon, 29 Sep 2008 10:37:00 -0700 Chris Dean wrote: > > "Matthew D. Swank" writes: > > One thing I did notice is that I read the file into an adjustable > > vector, and that is the string I pass to the scanners. I suppose > > ppcre has to coerce that every time a scanner runs? > > Yes, scan needs a simple-string. From the SCAN docs: > > target-string will be coerced to a simple string if it isn't one > already. > Coercing the slurped file to a simple string makes things work swimmingly. Thanks, Matt -- "You do not really understand something unless you can explain it to your grandmother." -- Albert Einstein.