[regex-coach] ".+" and ".+?" with optional parenthesized text

John Clements johnjc-regex at publicinfo.net
Sun Aug 22 13:04:47 UTC 2004


Hello,

This is my first post to this list. I have looked through the archives 
(searched on "greedy" and some other terms, actually) but don't find 
anything that seems to relate to my problem. So I'm writing to see if 
anyone else on the list has encountered something like this. There is 
something about the "." operator, especially the "non-greedy" version of 
it, and in particular its behaviour when used in conjunction with a 
parenthesized term which is optional.

I've put the pattern and a sample target string and written comments about 
the results I get from Regex Coach. I ran the pattern with "i" checked.

Pattern:
^\s*An appeal.+?(Joined )?Cases? ?t ?[-­] ?\d{1,3}\/ ?\d{2}.+?(between)?

Target string:
An appeal against the judgment delivered on 15 January 2003 by the Second 
Chamber (Extended Composition) of the Court of First Instance of the 
European Communities in joined cases T-377/00 (1), T-379/00 (2), T-380/00 
(2), T-260/01 (3) and T-272/01 (4) between Philip Morris International, 
Inc., R.J. Reynolds Tobacco Holdings, Inc., RJR Acquisition Corp., R.J. 
Reynolds Tobacco Company, R.J. Reynolds Tobacco International Inc., and 
Japan Tobacco, Inc., and Commission of the European Communities, supported 
by European Parliament, Kingdom of Spain, French Republic, Italian 
Republic, Portuguese Republic, Republic of Finland, Federal Republic of 
Germany, Hellenic Republic, Kingdom of the Netherlands, was brought before 
the Court of Justice of the European Communities on 25 March 2003 by R.J. 
Reynolds Tobacco Holdings, Inc., established in Winston-Salem, North 
Carolina (United States), RJR Acquisition Corp., established in Wilmington, 
Delaware (United States), R.J. Reynolds Tobacco Company, established in 
Winston-Salem, North Carolina (United States), R.J. Reynolds Tobacco 
International Inc., established in Winston-Salem, North Carolina (United 
States) and Japan Tobacco, Inc., established in Tokyo (Japan), represented 
by O.W. Brouwer, lawyer, and P. Lomas, solicitor.
============

What I want it to do is match the string from the beginning through 
"between", and when there is no instance of "between", I want it to match 
the entire string.

I would expect the example above to give me a match on 0-259 (i.e. through 
"between". But instead I get a match only on 0-189 (through the first case 
number).  This makes no sense to me whatsoever. I would consider it a bug 
but Regex Coach and Perl v5.8.3 on FreeBSD give me the same results.

^\s*An appeal.+?(Joined )?Cases? ?t ?[-­] ?\d{1,3}\/ ?\d{2}.+(between)?
gives me a match on 0-1279 (the whole string). Why doesn't it stop when it 
finds "between"?

^\s*An appeal.+?(Joined )?Cases? ?t ?[-­] ?\d{1,3}\/ ?\d{2}.+?(between)
gives me the match I expect, 0-259.

^\s*An appeal.+?(Joined )?Cases? ?t ?[-­] ?\d{1,3}\/ ?\d{2}.+(between)
also gives me the match I expect, 0-259.

But if I make the "(between)" optional, by putting a "?" after it,
  - the regex engine doesn't stop there when the ".+" is greedy, and
  - the regex engine doesn't find "between" when the ".+" is non-greedy, 
i.e. ".+?"

Can anyone enlighten me?

Many thanks, John Clements





More information about the regex-coach mailing list