Regular Expression Backreferences and the Non-Greedy Modifier

Update: James Allen caught a formatting bug. It seems WordPress doesn't like my coloring, and when present, swaps the double quotes for "smart quotes". I've removed the coloring, and it seem to be fine again.

Someone posted a question on CF-Talk about using backreferences in regular expression search strings. Not the replacement string, mind you, but the search string itself. This is, as you'd expect, perfectly legal and can be incredibly powerful. While contriving an example, the one I came up with also required the non-greedy modifier, so I'll illustrate both. Here's the code in question (copy it to a CFM file to run it):

<cfoutput>
<cfset baseString = "some 'text' with ""quotes an' some apostrophes"" in it" />
<h2>Quoted Strings within #baseString#</h2>
<ul>
<cfset start = 1 />
<cfloop condition="true">
  <cfset result = REFind("(['""])(.*?)\1", baseString, start, true) />
  <cfif result.pos[1] LT 1>
    <cfbreak />
  </cfif>
  <cfset string = mid(baseString, result.pos[3], result.len[3]) />
  <cfset quote = mid(baseString, result.pos[2], result.len[2]) />
  <cfset start = result.pos[1] + result.len[1] />
  <li>#string# (quoted with #quote#)</li>
</cfloop>
</ul>
</cfoutput>

The backreference (the "\1" at the end of the regex) behaves exactly the same as when in a replacement string: it represents the "stuff" matched by the first clause wrapped in parentheses (in this case, "['""]"). Ignoring the middle part, the regex says find me either a single or double quote, then some "stuff", and then the same quote character.

In the middle is the non-greedy modifier (the "?", after the asterisk). Without it the ".*" would match as much as possible while still allowing the regex as a whole to succeed. In this specific case, the quote would match the single-quote before "text", the .* would match "text with "quotes an", and then the backreference would match the single-quote after "an". That's clearly not what we want, so we use the non-greedy operator to tell the .* that we only want it to match as much as it needs to, not as much as it can. Then it behaves correctly. Try deleting the non-greedy modifier and run it.

One response to “Regular Expression Backreferences and the Non-Greedy Modifier”

  1. James Allen

    Nice example Barney. I only recently became aware of how backreferencing can be used so this was very useful.

    However, I found a problem with the code sample which I think is probably due to the single and double quotes being escaped by the blog code (maybe).

    Below is the line I had to fix to get it working:

    I'm not sure if this will post properly but here goes. :)