[cl-ppcre-devel] Using (unsigned-byte 8) instead of a (string) as TARGET-STRING

Edi Weitz edi at weitz.de
Thu Apr 26 19:40:25 UTC 2012


Phil,

Thanks for your work on this.  I'm afraid I won't be able to help as
I'm too busy with my new job and don't expect this to change soon.
Maybe someone else will.

I also have to admit I'm reluctant to add this to the main CL-PPCRE
code for fear of a maintenance nightmare.  CL-PPCRE already is far
from clean and clear and I'd rather not add more stuff.  Unless
someone else wants to take over maintenance completely, of course.

Cheers,
Edi.


On Thu, Apr 26, 2012 at 9:29 PM, Philipp Marek <philipp at marek.priv.at> wrote:
> Hello everybody,
>
> I've got a first (unoptimized) patch that allows to use (unsigned-byte 8)
> instead of a string as TARGET-STRING. My motivation is to use that for
> searching in big, binary files, which might not fit into RAM with byte =>
> character conversion (which would be 1:4, as the binaries would have to be read
> as latin1 or similar).
>
>
> Using that patch on a ~3MB file with the string match at about 80% of the file
> size shows a nice speedup: only half to a third cpu time used, and much less
> memory usage (for the string).
>
>
> Details:
> -rw-r--r-- 1 root root 3176746 Mai  3  2010 /usr/share/doc/gcc-4.4-doc/gccint.html
>
> string, case-sensitive:
>  0.360022 seconds of total run time (0.360022 user, 0.000000 system)
>  907,556,608 processor cycles
> string, case-insensitive:
>  0.492031 seconds of total run time (0.492031 user, 0.000000 system)
>  1,239,853,076 processor cycles
>
> (unsigned-byte 8), case-sensitive:
>  0.108006 seconds of total run time (0.108006 user, 0.000000 system)
>  274,043,748 processor cycles
> (unsigned-byte 8), case-insensitive:
>  0.220013 seconds of total run time (0.220013 user, 0.000000 system)
>  553,027,836 processor cycles
>
>
> The small "problem" is this (one long line):
>
>  $ time perl -e '$/=undef; $_=<>; print $1,$2,"\n" if
>    /acr([o0] i)s not defined,\s+the default(\Dvalue,\s*\d+, i)s used/'
>    < /usr/share/doc/gcc-4.4-doc/gccint.html
>  o i value, 1, i
>  real    0m0.016s
>  user    0m0.004s
>  sys     0m0.008s
>
>  $ perl -v
>  This is perl 5, version 14, subversion 2 (v5.14.2)
>  built for x86_64-linux-gnu-thread-multi
>
> ie. (this) perl5 is still ~8 times faster, including file reading etc. (what
> the lisp code didn't take into the measurement).
> (With the /i modifier it's 0.020s.)
>
>
> I've not yet tried to run the whole test suite against that. There are quite
> a few warnings (unused variable UB8-MODE etc.) - but with higher SAFETY the
> original CL-PPCRE gave a lot of them, too.
>
>
> I'd like to ask for a quick look at the patch, to get some feedback; with the
> many duplications I don't really like the result, but the duplicated accesses
> ("schar" etc.) are too deeply integrated in cl-ppcre, I couldn't easily get
> them out into a single macro or something like that.
>
>
> Cyrus, Edi, could you help me clean up the changes so that they
> could be taken upstream?
>
>
> The other file is the one I'm using for testing.
>
>
> Regards,
>
> Phil
>
>
> _______________________________________________
> cl-ppcre-devel site list
> cl-ppcre-devel at common-lisp.net
> http://common-lisp.net/mailman/listinfo/cl-ppcre-devel




More information about the Cl-ppcre-devel mailing list