Efficient Caching With mod_rewrite

Caching with mod_rewrite?  What?  I'll admit it's a slightly misleading title; the cache is actually a disk cache, but mod_rewrite is where the magic happens.  Bear with me for a moment…

Most content on the web is fairly static.  Some of it changes every few minutes, some changes every few hours, some changes a few times a month, and the vast majority of it changes approximately never.  However, a large percentage of it is generated dynamically, every request.  Maybe it's news articles, maybe it's thumbnails for images/pdfs/videos, maybe it's RSS feeds, but identical content is dynamically generated over and over again.  Huge waste of resources.

On the flip side, you can use pre-generation to build stuff ahead of time so you can serve everything statically.  However, that can be ridiculously expensive as well.  For example, my blog has several hundred (if not thousands) distinct feeds available on it.  The main one (listing posts), one per category (posts), one per author (posts), the main comment feed (listing comments), and one per post (comments).  Each of those is available in RSS 2.0, Atom 0.3, and RSS 0.92 formats.  Pregenerating those all the time is silly, because the vast majority of them will never be accessed, let alone frequently.

Ideally, we'd be able to generate these resources dynamically, on demand, but then keep the output around to serve back statically for subsequent requests.  This saves us the expense of pregenerating lots of stuff that will never be accessed, but gives us the speed of static access after the first request.

Duh, Barney, what's your point?

My points is that while this is, in a conceptual mindset, the obvious solution, it's ridiculously trivial to implement.  It'll take longer to read this post than to set it up.  As such, there's no excuse for being resource constrained on non-user-specific resources, even though this seems to be a really common complaint.

Here's a more concrete example.  Say I host photo galleries, allowing people to upload their full-size images, and I provide several views of the galleries with appropriate thumbnails.  Those pages are littered with things like this:

<img src="/gen_tn.cfm?id=12345&width=100&height=100" />

This is great, because I can create arbitrarily sized thumbnails without having to go back and regenerate them for all existing photos.  That's handy when I create a new layout and realize I want 125×125 thumbnails instead of 100×100, and then want to use 250×250 for the 'featured' section.  But I'm generating the thumbnails dynamically every request, which is a waste.  And adding caching in gen_tn.cfm is the wrong answer.  : )

First, let's change the URLs in the pages to look like this:

<img src="/tn/p12345-100x100.jpg" />

Same information as before, just packaged differently.  Then I'll use the following RewriteRule to (internally) turn it back into the original request to gen_tn.cfm (effectively a no-op):

RewriteRule  ^/tn/p([0-9]+)-([0-9]+)x([0-9]+)\.jpg$  /gen_tn.cfm?id=$1&width=$2&height=$3  [PT,L]

Lipstick on a pig, you might say, and you'd be almost right.  We now have normal-looking URLs for our thumbnails (lipstick), but they're still dynamically generated every request (on a pig).  This abstraction, however, is incredibly powerful.  Lets add a RewriteCond in front of that rule real quick:

RewriteCond  %{REQUEST_FILENAME}                     !-s
RewriteRule  ^/tn/p([0-9]+)-([0-9]+)x([0-9]+)\.jpg$  /gen_tn.cfm?id=$1&width=$2&height=$3  [PT,L]

That says to only do the RewriteRule if the requested file doesn't exist or is zero length ('-s' says a regular file with non-zero length, the '!' negates it).  Next step is to create the 'tn' directory in your web root and ensure it's writable by your application server.  You can probably see where I'm going with this…

The final step is to tweak gen_tn.cfm slightly.  Currently, it creates the thumbnail and serves it back to the client.  We need to change it so that before serving it back, it writes it to disk in that new 'tn' directory, using the appropriate filename.  Once that's done, send it to the client as usual.  The next time the thumbnail is requested, Apache will hit the RewriteRule, but the RewriteCond will not match (because the file exists and has length).  As such, it won't be rewritten to gen_tn.cfm, and will instead be served statically directly from disk bypassing the application server completely.

With those couple simple changes, you suddenly have a ridiculously effective caching mechanism in place.

What about changes to the source, though?  You realize one of your photos (#12345) was miscropped, so you fix it and upload a new version, but you want your thumbnails to be regenerated too.  Fortunately, flushing the cache is as simple as deleting all files in 'tn' that match '*p12345*.jpg'.

Same thing goes for deletions.  If you decide you just want to remove photo #12345 completely and want to remove the thumbnails too, run the same deletion of '*p12345*.jpg' from the 'tn' directory.  Or if you stop using 100 pixel thumbnails (like when I switched to 125×125 a few paragraphs ago), you can just delete '*100×100*.jpg'.

Because you're using the filenames as an index of sorts, it means you have to name your files carefully.  The filename needs to contain not only everything to uniquely specify the file (photo ID, width, and height in this example), but also everything that you might want to use for clearing the cache.  For example, if you need the ability to clear based on gallery ID you'd need to change the URL to '/tn/g123-p12345-125×125.jpg' or something.  In this case the gallery ID isn't needed for unique specification, only for flush selection.

The net of this is that you can hit that sweet spot: avoiding any extra work generating resources that aren't accessed, and never generating the same resource more than once.  Obviously the first request to a resource has to wait for generation, so this technique isn't suitable for all use cases, but it covers a huge swath of them.  It's especially well suited to situations where you have a large number of resources and have either relatively light usage across them and/or need the ability to change the derived resources' specifications (e.g. new thumbnail dimensions or new XML feed formats).

As you'd imagine, PotD (NSFW, OMM) uses this technique extensively for several classes of thumbnails as well as RSS feeds.  It also does some pre-generation where the first-request delay is unacceptable.  I also used this to great effect at my previous employer's for front-end caching of CMS-generated HTML pages.  We handled hundreds of millions of pages per day on a pair of single-P4 servers with 1GB of RAM each, with an average cache life of between two and four hours.

One significant gotcha is that you only get full-request caching with this technique.  I.e. you can't cache portions of a request's response, because it's either fully dynamic (the first request) or fully static (subsequent requests).  For example, most blogs have a "remember me" feature so you don't have enter you information each time you want to comment.  In order to beat this, you need some sort of two-phase generation where the cache happens between the phases, and that means you have to have your application running "above" the cache.  Ajax can be used as the second phase, but that's a disaster waiting to happen, if you ask me.

9 responses to “Efficient Caching With mod_rewrite”

  1. Jamie Krug

    Hey, Barney, this is really interesting stuff — keep it coming!

    In fact, I was working on some mod_rewrite stuff just this morning, which is slightly related. I was using the !-s in a RewriteCond for a slightly different purpose, but I became concerned that this could actually worsen performance in my case. I assume "RewriteCond %{REQUEST_FILENAME} !-s" must hit the file system every time, since it's checking for the existence of a file on disk and its size, so there must be some penalty there, right?

    In your case, I'd imagine it's probably worthwhile, because that's certainly faster than regenerating an image every time, however, that condition check occurs on *every* other request (unless you can use other RewriteCond/RewriteRule logic to "short-circuit" when the !-s check is not required).

    Here's my situation: I'm simply using a rewrite rule for a CMS so I can avoid /index.cfm/alias-stuff/here/ type URLs and rewrite something like /alias-stuff/here/ instead. However, there are other *.cfm URLs on the site that I do not want to rewrite. So, here was my first pass, which works great:

    RewriteCond  %{DOCUMENT_ROOT}/%{REQUEST_FILENAME}  !-d
    RewriteCond  %{DOCUMENT_ROOT}/%{REQUEST_FILENAME}  !-s
    RewriteRule  ^(.*)$  /index.cfm%{REQUEST_URI}  [NE,QSA]
    

    When I became paranoid about the disk hits, I changed it to the following, which has one RewriteCond that will be met as long as there is a period (.) in the REQUEST_URI (because my CMS-aliased URLs will never contain a period, but any URL with a file exension will):

    RewriteCond  %{REQUEST_URI}  !^(.*\..+)
    RewriteRule  ^(.*)$  /index.cfm%{REQUEST_URI}  [NE,QSA]
    

    This seems to work great for my current situation, but isn't quite as universally reusable as the first option. My thinking is that the one condition with a regex evaluation will perform better than 2 disk hits, which would occur on every request, even the many requests for images, CSS, JS, etc. Am I way too paranoid, or does this make sense to you?

    One other option I considered was a condition that ensured specific extensions were not in the URL, but it seemed unruly and more difficult to maintain:

    RewriteCond  %{REQUEST_URI}  !^(.*\.(avi|bmp|cfc|cfm|cfml|cfr|css|csv|doc|gif|htm|html|jpg|js|mov|mp3|mpeg|php|pdf|png|ppt|swf|txt|xls))
    

    Finally (sorry this is such a long-winded comment!)… You may have noticed that I used this:

    RewriteCond  %{DOCUMENT_ROOT}/%{REQUEST_FILENAME}  !-s
    

    …instead of this:

    RewriteCond  %{REQUEST_FILENAME}  !-s
    

    Everything I've read, including your example, suggests that %{REQUEST_FILENAME} would represent the full physical path, but I had to include "%{DOCUMENT_ROOT}/" in front to get it to work. Do you know of another Apache configuration that would cause this issue? FWIW, I'm doing all of my rewrite rules inside my vhosts, not .htaccess.

    Thanks!

  2. Jamie Krug

    Barney,

    Dude, I apologize — my first comment is literally half the length of your post! Here's a quick follow up…

    Along the lines of my prior comment — I'm assuming Apache is smart enough to stop testing chained RewriteCond if one fails (since adjacent RewriteCond lines are chained by implicit AND's), so in your example I think you could save those microseconds on the disk hit that I'm worried about ;-) You could add this first RewriteCond prior to what you show in the example… I think?

    RewriteCond  %{REQUEST_URI}  ^/tn/p([0-9]+)-([0-9]+)x([0-9]+)\.jpg$
    RewriteCond  %{REQUEST_FILENAME}  !-s
    RewriteRule  ^/tn/p([0-9]+)-([0-9]+)x([0-9]+)\.jpg$  /gen_tn.cfm?id=$1&width=$2&height=$3  [PT,L]
    
  3. Jamie Krug

    Wow, Apache is smart :) That's great info, and the WordPress MU sample is extremely helpful. Thanks!

  4. Henry Ho

    hmm… does this work with URL Rewrite on IIS7?

  5. Matthias Luther

    Hi,

    I try to realize a script witch Rewrite to set the caching of images to an month and to compress the js-scripts. It is for a static html-page.

    The problem is I have no idea how to manage that! Is it possible in a .htaccess?

    Greeds & Thanks,
    Matthias

  6. Matthias

    The Script that I tried is:

    RewriteRule /

    [redirect=permanent,cachelifetime="1 month"]

    But it doesn`t work!!!!