[cl-ppcre-devel] Need help with a slow regexp
Edi Weitz
edi at agharta.de
Thu Jan 20 20:51:05 UTC 2005
Hi!
On Thu, 20 Jan 2005 15:16:52 -0500, pete-cl-ppcre at kazmier.com wrote:
> Not sure if this is the appropriate forum as the email is not
> related to the development of cl-ppcre, but I did not find a list
> for users. Please feel free to redirect me elsewhere.
It's fine to ask this kind of questions here.
> I could use some help in figuring out why this regexp is so slow.
> As far as I can tell, there is nothing abnormal about it. I
> currently use the same regexp in python and its blazes through the
> input file. Bear in mind, this is the first time that I've used
> cl-ppcre. It is was an experiment to see if I could lisp for this
> little application.
>
> Here is the regexp (at least a small portion of it that exhibits the
> behavior I am seeing):
>
> ^(?:\\S+ ){7}(\\S+)\\s+- commAlarm
>
> Here is the input line it is matching against (note: this is a single
> line albeit a long one):
>
> 1105243660 11 Sun Jan 09 04:07:40 2005 sclax02.ibasis.net - commAlarm ovnyc00p.ov.i\vanet.net [1] private.enterprises.2496.1.1.5.5.1.0 (Integer): 0 [2] private.enterprises.\2496.1.1.5.5.2.0 (Integer): 115 [3] private.enterprises.2496.1.1.5.5.3.0 (OctetString): \ISUP: UNEX ANM [4] private.enterprises.2496.1.1.5.5.4.0 (OctetString): ISDN User Part Un\expected ANM [5] private.enterprises.2496.1.1.5.5.5.0 (Integer): 2 [6] private.enterpri\ses.2496.1.1.5.5.6.0 (Integer): 1 [7] private.enterprises.2496.1.1.5.5.7.0 (Integer): 1 \ [8] private.enterprises.2496.1.1.5.5.8.0 (Integer): 2 [9] private.enterprises.2496.1.1.\1.1.1.1.1.1.1.1376258 (Integer): 1376258 [10] private.enterprises.2496.1.1.1.1.1.1.1.1.2\.1376258 (Integer): 21 [11] private.enterprises.2496.1.1.1.1.1.1.1.1.4.1376258 (OctetStr\ing): ss7path-att [12] private.enterprises.2496.1.1.1.1.1.1.1.1.5.1376258 (OctetString):\ SS7 Path For ATT and NGT DPC 5.21.39 [13] private.enterprises.2496.1.1.1.1.1.1.1.1.3.13\76258 (Integer): 1245188 [14] private.enterprises.2496.1.1.5.5.9.0 (Integer): 1105243880\;1 .1.3.6.1.4.1.2496.1.1.4.1 0
>
> Stuff 51 of those lines above into a into a file and try to match on
> that regexp and I get the following results:
>
> PGW> (time (parse-file "/tmp/sample"))
> Evaluation took:
> 2.984 seconds of real time
> 1.81 seconds of user run time
> 1.12 seconds of system run time
> 0 page faults and
> 228,191,424 bytes consed.
That's much too slow and much too much consing. FWIW, here's what I
get with SBCL 0.8.16 and 50 lines like the one from above:
* (time (parse-file "/tmp/sample"))
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Evaluation took:
0.148 seconds of real time
0.1 seconds of user run time
0.01 seconds of system run time
0 page faults and
506,416 bytes consed.
T
> My platform is SBCL 0.8.18.23 and version 1.0 of cl-ppcre.
My wild guess is that this is due your version of SBCL being from the
new Unicode branch which I haven't tried yet. If you don't need full
Unicode support then maybe you should switch it off. Or better,
report this to the SBCL maintainers (if my guess is right). Also, see
the note about simple strings in the CL-PPCRE docs.
To show you that CL-PPCRE is not necessarily slow with full Unicode
support here's the output from AllegroCL 7.0:
CL-USER(5): (time (parse-file "/tmp/sample"))
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
Found host sclax02.ibasis.net
; cpu time (non-gc) 100 msec user, 30 msec system
; cpu time (gc) 110 msec user, 0 msec system
; cpu time (total) 210 msec user, 30 msec system
; real time 294 msec
; space allocation:
; 15,326 cons cells, 13,453,192 other bytes, 0 static bytes
T
> I am hoping to parse a file that has close to 75,000 lines in that
> format. At this rate, I will never make it in a reasonable amount
> of time.
Here's the output from CMUCL 19a for 100,000 lines like above (with
the FORMAT form in your function removed and Linux running within
VMWare on my laptop):
* (time (parse-file "/tmp/sample"))
; Compiling LAMBDA NIL:
; Compiling Top-Level Form:
; Evaluation took:
; 20.27 seconds of real time
; 3.01 seconds of user run time
; 17.25 seconds of system run time
; 40,541,141,192 CPU cycles
; [Run times include 1.05 seconds GC run time]
; 0 page faults and
; 938,959,528 bytes consed.
;
T
If that doesn't help, let me know.
Cheers,
Edi.
More information about the Cl-ppcre-devel
mailing list