From airboss at nodewarrior.org Wed Apr 26 20:46:35 2006 From: airboss at nodewarrior.org (Dan Debertin) Date: Wed, 26 Apr 2006 15:46:35 -0500 Subject: [cl-ppcre-devel] Detecting "partial" matches Message-ID: <21059.1146084395@nodewarrior.org> Hi, I'm using CL-PPCRE to develop a character-at-a-time lexer. This is causing me some perplexity, though, with regexes like the common notation for hexadecimal literals: "^0(?:x[0-9A-Fa-f]+)?$" This should match both the string "0", between positions 0 and 1, as just a bare literal zero, and should also match things like "0xa6" between positions 0 and 3, but should not match simply "0x". But I want the longest match possible, so (for example) I'd like to know that while "0x" didn't match, parts of the regex *did* match and might produce a "real" match depending on what comes after "x". So, in succession, if the input is "0xa6 ", my scanner gets called thus: 1. Input: "0". a) A match. b) But it *could* possibly match more, depending on what comes next. 2. Input: "0x". a) Not a match. b) But, once again, the possibility exists that more input could still produce a longer match than "0". 3. Input: "0xa". a) A match. b) Because of the "+" attached to the character class, a longer match is still possible. 4. Input: "0xa6". a) A match. b) As above. 5. Input: "0xa6 ". a) Not a match. b) Will *never* match no matter how much more input you add to it. CL-PPCRE just tells me a), and I also want to know b). Is there any way to get this information (if it even exists) out of the scanner? TIA, -Dan -- Dan Debertin | airboss at nodewarrior.org | www.nodewarrior.org | From edi at agharta.de Wed Apr 26 21:26:12 2006 From: edi at agharta.de (Edi Weitz) Date: Wed, 26 Apr 2006 23:26:12 +0200 Subject: [cl-ppcre-devel] Detecting "partial" matches In-Reply-To: <21059.1146084395@nodewarrior.org> (Dan Debertin's message of "Wed, 26 Apr 2006 15:46:35 -0500") References: <21059.1146084395@nodewarrior.org> Message-ID: On Wed, 26 Apr 2006 15:46:35 -0500, Dan Debertin wrote: > CL-PPCRE just tells me a), and I also want to know b). Is there any > way to get this information (if it even exists) out of the scanner? If I understand you correctly, this information isn't available. The regex engine doesn't make any plans about what might have happened if it had scanned another target string instead of the current one, so to say... :) You might want to look at filters, though: But generally, your problem to me looks as if regular expressions aren't the right way to tackle it. Cheers, Edi. From daniel.caune at ubisoft.com Thu Apr 27 23:03:25 2006 From: daniel.caune at ubisoft.com (Daniel Caune) Date: Thu, 27 Apr 2006 19:03:25 -0400 Subject: [cl-ppcre-devel] Matching problem while using \s (?) Message-ID: <1E293D3FF63A3740B10AD5AAD88535D2021A2E4A@UBIMAIL1.ubisoft.org> Hi, I'm facing a strange result while trying to scan a string: CL-USER> (defconstant +iso-8601-regex+ "([0-9]+)-([0-9]+)-([0-9]+)[\sT]*([0-9]+):([0-9]+):([0-9]+)[+-]([0-9]+)" ) +ISO-8601-REGEX+ CL-USER> (cl-ppcre:scan-to-strings +iso-8601-regex+ "2006-04-13 00:00:00+00") NIL CL-USER> (cl-ppcre:scan-to-strings +iso-8601-regex+ "2006-04-13T00:00:00+00") "2006-04-13T00:00:00+00" #("2006" "04" "13" "00" "00" "00" "00") For some reasons, \s seems not to match a whitespace character, unless the space within "2006-04-13 00:00:00+00" is not the space cl-ppcre expects (encoding on Linux?! Hmm... a space is more likely to be encoded with 0x20, isn't it?!). I tried to match both "2006-04-13 00:00:00+00" and "2006-04-13T00:00:00+00" with the Regex Coach (on Windows), and it works perfectly with that same regex! Any idea? What can I check? Regards, -- Daniel CAUNE Ubisoft Online Technology (514) 4090 2040 ext. 5418 From emailmac at gmail.com Fri Apr 28 02:00:05 2006 From: emailmac at gmail.com (Mac Chan) Date: Thu, 27 Apr 2006 19:00:05 -0700 Subject: [cl-ppcre-devel] Matching problem while using \s (?) In-Reply-To: <1E293D3FF63A3740B10AD5AAD88535D2021A2E4A@UBIMAIL1.ubisoft.org> References: <1E293D3FF63A3740B10AD5AAD88535D2021A2E4A@UBIMAIL1.ubisoft.org> Message-ID: <4877ae640604271900x583dd13es7e97d767dba7bfa@mail.gmail.com> you need to double escape \\s, since the lisp reader will consume one and cl-ppcre:scan-to-strings will only see s. On 4/27/06, Daniel Caune wrote: > Hi, > > I'm facing a strange result while trying to scan a string: > > CL-USER> (defconstant +iso-8601-regex+ > "([0-9]+)-([0-9]+)-([0-9]+)[\sT]*([0-9]+):([0-9]+):([0-9]+)[+-]([0-9]+)" > ) > +ISO-8601-REGEX+ > > CL-USER> (cl-ppcre:scan-to-strings +iso-8601-regex+ "2006-04-13 > 00:00:00+00") > NIL > > CL-USER> (cl-ppcre:scan-to-strings +iso-8601-regex+ > "2006-04-13T00:00:00+00") > "2006-04-13T00:00:00+00" > #("2006" "04" "13" "00" "00" "00" "00") > > > For some reasons, \s seems not to match a whitespace character, unless > the space within "2006-04-13 00:00:00+00" is not the space cl-ppcre > expects (encoding on Linux?! Hmm... a space is more likely to be encoded > with 0x20, isn't it?!). > > I tried to match both "2006-04-13 00:00:00+00" and > "2006-04-13T00:00:00+00" with the Regex Coach (on Windows), and it works > perfectly with that same regex! > > Any idea? What can I check? > > Regards, > > > -- > Daniel CAUNE > Ubisoft Online Technology > (514) 4090 2040 ext. 5418 > > _______________________________________________ > cl-ppcre-devel site list > cl-ppcre-devel at common-lisp.net > http://common-lisp.net/mailman/listinfo/cl-ppcre-devel >