[regex-coach] ".+" and ".+?" with optional parenthesized text
John Clements
johnjc-regex at publicinfo.net
Sun Aug 22 13:04:47 UTC 2004
Hello,
This is my first post to this list. I have looked through the archives
(searched on "greedy" and some other terms, actually) but don't find
anything that seems to relate to my problem. So I'm writing to see if
anyone else on the list has encountered something like this. There is
something about the "." operator, especially the "non-greedy" version of
it, and in particular its behaviour when used in conjunction with a
parenthesized term which is optional.
I've put the pattern and a sample target string and written comments about
the results I get from Regex Coach. I ran the pattern with "i" checked.
Pattern:
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}\/ ?\d{2}.+?(between)?
Target string:
An appeal against the judgment delivered on 15 January 2003 by the Second
Chamber (Extended Composition) of the Court of First Instance of the
European Communities in joined cases T-377/00 (1), T-379/00 (2), T-380/00
(2), T-260/01 (3) and T-272/01 (4) between Philip Morris International,
Inc., R.J. Reynolds Tobacco Holdings, Inc., RJR Acquisition Corp., R.J.
Reynolds Tobacco Company, R.J. Reynolds Tobacco International Inc., and
Japan Tobacco, Inc., and Commission of the European Communities, supported
by European Parliament, Kingdom of Spain, French Republic, Italian
Republic, Portuguese Republic, Republic of Finland, Federal Republic of
Germany, Hellenic Republic, Kingdom of the Netherlands, was brought before
the Court of Justice of the European Communities on 25 March 2003 by R.J.
Reynolds Tobacco Holdings, Inc., established in Winston-Salem, North
Carolina (United States), RJR Acquisition Corp., established in Wilmington,
Delaware (United States), R.J. Reynolds Tobacco Company, established in
Winston-Salem, North Carolina (United States), R.J. Reynolds Tobacco
International Inc., established in Winston-Salem, North Carolina (United
States) and Japan Tobacco, Inc., established in Tokyo (Japan), represented
by O.W. Brouwer, lawyer, and P. Lomas, solicitor.
============
What I want it to do is match the string from the beginning through
"between", and when there is no instance of "between", I want it to match
the entire string.
I would expect the example above to give me a match on 0-259 (i.e. through
"between". But instead I get a match only on 0-189 (through the first case
number). This makes no sense to me whatsoever. I would consider it a bug
but Regex Coach and Perl v5.8.3 on FreeBSD give me the same results.
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}\/ ?\d{2}.+(between)?
gives me a match on 0-1279 (the whole string). Why doesn't it stop when it
finds "between"?
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}\/ ?\d{2}.+?(between)
gives me the match I expect, 0-259.
^\s*An appeal.+?(Joined )?Cases? ?t ?[-] ?\d{1,3}\/ ?\d{2}.+(between)
also gives me the match I expect, 0-259.
But if I make the "(between)" optional, by putting a "?" after it,
- the regex engine doesn't stop there when the ".+" is greedy, and
- the regex engine doesn't find "between" when the ".+" is non-greedy,
i.e. ".+?"
Can anyone enlighten me?
Many thanks, John Clements
More information about the regex-coach
mailing list