Using Java Regular Expressions In CFML

While doing some regex stuff today I discovered that both ColdFusion and Railo use an external regular expression engine for the REReplace function instead of the one built into the core JRE (the java.util.regex.* classes).  I don't think I've ever had a reason to care until now, but today I was stuck.  Fortunately, since CFML strings are java.lang.String instances, you can use the replaceFirst and replaceAll methods on them.  You can also use the Pattern/Matcher classes directly, of course, but the String methods are easier and usually sufficient.

Why might you care?  Lookaround.  Say you want to replace all the single 'o' characters in the string "oh, I love cookies" with zeros.  Here's how you'd do it:

"oh, I love cookies".replaceAll("(?<!o)o(?!o)", "0") // 0h, I l0ve cookies

What the hell is that mess, you ask?  Look behind (in red) and look ahead (in blue).  Both parts are ignored until the 'o' in the middle is matched.  Then the look behind says "only if you're not preceeded with an 'o'", and the look ahead says "only if you're not followed by an 'o'."  This is a contrived example, but it illustrates.

What's the downside?  You can't do case translation with \u, \l, \U, \L and \E like you can with the CFML-native engine.  That's a really handy feature, and hopefully it'll make it into core Java at some point, but for right now it's not there.  Other than that, no real downside.

The actual stumbling block I ran into was for an April Fool's day joke: a filter that would replace content in HTML documents, but leave "important" stuff alone.  The filter takes a list of replaces to make on the content, and prefixes each regular expression with this string:

(?![^<]*</(?i)(?:textarea|script|style)(?i)\W)(?<=(?:>|^|\\G)[^<]*?)(?<!&(?:[a-zA-Z][a-zA-Z0-9]{0,25}|#[0-9]{0,25}))

before running it against the content.  That uses a negative lookahead, a positive lookbehind, and a negative lookbehind to ensure that the expression only replaces stuff that isn't nested with TEXTAREA, SCRIPT, and STYLE tags, isn't part of any HTML tag, and isn't part of an HTML Entity.  It also uses three non-capturing groups (in bold), to ensure that the prefixing of the expression with this extra stuff doesn't screw up backreference indexing.

With this filter in place, you can supply some arbitrary replaces to content and have them made on the fly, without actually breaking anything.  For example, if you wanted to replace all 'e' with '3', 'o' with '0', and 'barney' with 'The Supreme Commander', you'd configure it like this:

/e/3/i
/o/0/i
/barney/The Supreme Commander/i

Without the protection from the above prefix, you HEAD tag would become a H3AD tag, and your BODY tag would become B0DY.  Not ideal.  But with the protection, only content gets replaced, not any of the markup, so everything will still render correctly.

Completely pointless?  You bet.  An interesting experiment? That too.  In the end, I ended up shelling out to Groovy (via CFGroovy) and used the Pattern class directly along with some other Java APIs to get my work done.  Still a technique to keep in the toolbox, though.

2 responses to “Using Java Regular Expressions In CFML”

  1. John Allen

    That is very very very sharp.

  2. Chris Carey

    Hey I just stumbled across this and your experiment is *exactly* what I've been wrestling with!

    I'm just trying to understand how you've set this up with the prefix. Are you just appending the prefix to the beginning of your subsequent regex? I can't make that work. With the prefix it doesn't replace anything. Without the prefix it replaces inside the tag attributes (which is what I'm trying to avoid).

    Here's my attempt:

    prefix = "(?![^<]*</(?i)(?:textarea|script|style)(?i)\W)(?|^|\\G)[^<]*?)(?<!&(?:[a-zA-Z][a-zA-Z0-9]{0,25}|##[0-9]{0,25}))";

    str = "some html code";

    regex = "e";
    replaceWith = "3″;

    result = str.replace(prefix & regex, replaceWith);

    Any advice?