Syncing Files on the Cloud

A couple weeks ago, I gave my "Replace Your Iron With A Cloud" talk from cf.objective to the CFMeetUp.  If you didn't catch it, you can view the recording on that page.  Several people both during and after the presentation had questions about syncing files between servers.  This doesn't really have anything to do with hosting in the cloud in particular, as you'll have the same problem with multiple physical servers, but the cost effectiveness of cloud computing is often the means by which smaller applications move from a single (physical) server to multiple (cloud) servers.

Due to those questions, I wanted to run through a few points on the syncing issue here, rather than continuing to answer individual emails/tweets/etc.  I'm going to talk about a hypothetical blogging application, as blogs are simple and well understood, and at least for this discussion, illustrate all the interesting bits and pieces.

Before we get started, if you're not using version control, stop reading, go set it up (either yourself or through one of the myriad free/cheap hosting providers), and check your code in. I like Subversion. Git seems to be the current hotness. Stay away from CVS and Visual SourceSafe. Then you can come back and read the rest of this.

First and foremost, every piece of data has a single canonical home.  That is, without a doubt, the most important thing to keep in mind.  So what is the canonical home for these data?  It depends on the type, and as I see it, there are four main ones (with their canonical home):

  1. server configuration: version control (unless you're Windows/IIS/etc., then you have to be more creative)
  2. source code and assets: version control
  3. user data (e.g., blog posts): your database cluster (you should think of it as a cluster, even if there is only one node)
  4. user assets (e.g., uploaded photos): in the database or on a dedicate hosting service (your own, or third party)

Most of those places aren't very useful for running your application.  So you invariably copy your data around, caching it in various locations for various purposes.  What you have to remember is that every copy of any piece of data outside the canonical home is exactly that: a copy.  If you change it, it doesn't mean anything until the change is propagated back to the canonical home.  Think about your source code.  You're happily hacking away on your local development machine and build some awesome new feature.  That's well and good, but it doesn't "count" until it's been checked into version control.  Your local box is a copy of the data, not the real data.  That same mindset applies for all data, regardless of type.

Let's say our blogging app runs on three application servers atop a two-node database cluster.  We'll assume the cluster is already doing it's job of keeping data synced between the nodes, handling failover if there is a problem, etc.  The first thing is server config.  How do we ensure all three application servers are configured identically?  We need to push a copy of the configuration from the canonical home to each server.  Then we know they're identical.  When we need to update configuration, we have to do the same thing.  You make and test the changes somewhere (hopefully on a test site, but perhaps on one of the production boxes), push those modifications back to the canonical home, and then update each server's copy of the data.  Do NOT, under any circumstances, log into each box and manually make the same configuration change on each one.  It'll work, as long as you're perfect, and nobody is perfect.

Now that we now our application servers are all configured the same, the next task is to ensure the code is the same across all of them.  You could (and might want) to consider your code to be part of your server configuration, but for right not lets assume you have servers which will be configured once and run multiple sequential versions of the application.  The best way to get your code from version control to your servers is via a single button click.  This isn't necessarily practical for all situations, but it is an ideal.  The approach I typically take is two button clicks: one to build a release from the current development branch and put it into a special area of the version control repository, and the second to take a release and deploy it to one or more of the production servers.  Note that the "build" process might be nothing more than copying files if you have a simple app, but more likely will at least involve aggregating and compressing JS/CSS resources and possibly compiling code.

As for actually getting the code to your servers, my tool of choice is rsync.  It does differential pushes, so if you have a 400MB app (like we do at Mentor) you can still push out releases quickly since most of that won't change every time.  Rsync runs on all major platforms, and doesn't have issues crossing platform boundaries.  It's there by default on pretty much all modern *nixes, and it's simple to set up on Windows with one of various prebundled packages.

Alright, now we can be confident that our app servers are all synchronized for the developer data portions of our full data set.  That's the easy part.  The user data is a bit trickier.  For the blog posts, comments, user accounts, etc. the solution is simple.  We have a single database cluster that all of our app servers talk to in realtime for reads and writes.  The database cluster will take care of consistency and availability through whatever magic is needed and supplied by the database vendor.  My typical configuration, just for reference, is two or more MySQL servers running circular (master-master) replication.  Whatever your platform is, you can pretty much get the equivalent.

If you have huge data volume and/or large scaling capacity, this approach probably isn't ideal.  You'll need to look at a distributed database of some sort, either sharding a relational database, or use a more purpose-specific database (e.g., Apache Cassandra) which is designed around a distributed store.  I'm not talking about these use cases.  If you have this kind of problem, hire someone to help solve it.  At that scale there is no such thing as a general purpose solution.  General tools, yes, but the specific use of them is bound tightly to your application's needs.

Now the worst of the four: user assets.  Why are these bad?  Because they're typically accessed directly.  For example, if a user uploads a photo to display in their blog post, the web browser wants to request that image after it gets the page with the post in it.  Which means with our three app servers, when we put the post into the database with the image reference, we need to get the image itself onto the other two application servers.  This is the wrong approach.  Remember the canonical home?  That can't be three places, it has to be one.

The most obvious might be to just stick the image into a BLOB field in the database.  That certainly works, it's tried and true, and it doesn't require any new infrastructure.  However, it means that every request for an image had to go through your application code to make a SQL query to get the bits and then stream it back to the browser.  That can be a big performance issue, especially if you have large files, as while the bits are streaming out, you're consuming both a database connection and an app server thread.  If it takes the browser 20 seconds to pull down a huge image and you're burning a thread and a connection for that entire duration, you're going to run into scalability issues very quickly.  However, if you have low concurrency, this is a perfectly viable approach.

A much better solution, especially for assets which need to be served back to clients directly, is to set up a static file hosting "cluster" analogous to the database cluster.  That means one or more nodes which will internally take care of data synchronization amongst them.  This may seem like we're back to the same "sync the file across the app servers" problem, but it's not.  The difference is that the application now has a single place to stick something (the cluster) so it doesn't care about replicating it for durability/scalability.  We can separate that concern out of the application and into a dedicated system.  In effect, this is just another database system, except it's managing filesystem assets instead of tables of data, but we don't care too much. It's the separation that is important.

This sort of model also allows multiple applications to interface with the same assets far more easily than if the assets are stored within an application's database.  Say this blogging application is really successful and you want to spin off a photo sharing site.  It'll be a separate application, but you want to let users easily leverage their blog photos on the photo sharing side, and be able to easily insert photos from their galleries into their blog posts.  By having the images stored outside of the blogging application, it becomes far more easy to reuse them.

And this brings us to my favorite of the cloud services from Amazon: Simple Storage Service or S3.  S3 is exactly what I've outlined in the two previous paragraphs.  It's incredibly inexpensive, ridiculously scalable, and has interface libraries available for pretty much every major language you could want.  It's simple to use, supports very fine grained security (should you need to have private or semi-private assets), and just generally solves a whole bunch of nasty problems.

So to wrap this all up, I'm going to run through the flows:

First, when we want a new server, we configure it from our repository.  If we need to change the configuration, we change it in the repository and push it out to all the servers.

Next, when we want to change the application, we make our mods and commit it to the repository, and then use an automated process (which, I might add, should be configured from version controlled configuration) to build a release and push releases out to our application servers.

When a user is using our application, we'll take any data they give us and place it in a single location for use by all application servers.  In the case of simple data (numbers, strings, etc.) it'll go into some kind of database.  For binary data (images, zips, etc.) it'll go into an asset store.  In either case, all accesses and modifications to the data by the application are directed at a single repository, so there is no need to do synchronization.

Oof.  That's a lot.

The last thing is to consider caching.  For this particular example, let's assume we're going to stick our blog uploads into the database.  We don't have the resources/ability/time to adopt something like S3, and the limitations on using the database for BLOB storage are acceptable.  So how can we optimize?  Well, the easiest thing is to make BLOBs immutable (so if you want to upload a new version of something, you'll get a new BLOB rather than update the existing BLOB, and the application will track lineage), and then you can cache them wherever you want without having to worry about change.  For example, you have your images at URLs of the style /files/?id=123.  When a request comes in, the application server looks on it's filesystem for the image to stream back.  If it's not there, it'll query the database to get it, and then both write it to the filesystem and stream it out.  On the next request, the file will already be there, so the database won't be queried.

We can do better than that, though, because we really don't want the application server involved on those cached requests.  By using something like Apache HTTPD's mod_rewrite, we can have URLs like /files/123.jpg.  When a request comes in and the file exists, Apache will stream it back without involving our application at all. If the file doesn't exist, Apache will rewrite it to something like /files/?id=123 which will do the same thing as before: query from the database, save to the filesystem for future requests, and stream the file back to the browser.

Why did I stress immutability before?  Because this mechanism, while very simple, has no way of detecting if a locally cached file has been updated in the database.  There are various ways to get around this issue, and I'm not going to go into details.  What's important is that user data, just like your source code, doesn't have to be used directly from it's canonical home – it can absolutely be copied around for whatever reason (usually performance) as long as you still have a single canonical home for it.

This has now morphed into an incredibly long post.  I probably ought to have broken it up into several pieces.  Perhaps next time.  :)

One response to “Syncing Files on the Cloud”

  1. Yaron Kohn

    Excellent blog…just in time for our needs…Thanks.