A Final Note

The past years, months and weeks have had a huge amount of change wrapped up within them for me.  Much of that change had immediate effects, but some of it has been more subtle.  There are three major effects which are manifesting at the present time: a new career, a new job, and a new blog.  All are manifesting at the year boundary which is, for some reason, the usual for major change in my life.

First, I am embarking on a new career, that of a software craftsman.  I have been a CFML developer for over a decade, and while it has earned me a nice income, it hasn't been terribly fulfilling.  CFML itself is part of the problem, but the places where CFML is widely accepted and adopted are a bigger portion.  I seem to be quite proficient at using CFML to build web applications  above spec and under budget, but I've come to realize over the years that my objectives are within the craft of software development, not in making piles of money.

Facilitating this shift of focus is a change of jobs.  I've worked at Mentor Graphics for just over five years and the team I have been a part of has grown in leaps and bounds.  We have morphed from a group of CommonSpot hackers to a team responsible for a pile of CFML, Python, and PHP applications, along with all the infrastructure to run it scalably and reliably.  It's worth mentioning that our team is a chunk of Mentor's corporate marketing department, and our IT department comes to us for expert opinions.   I have given my notice and my last day at Mentor will be January 6th, 2012.

On the 9th I will be starting a new job with a company providing healthcare information management and analytics software.  The company (ARMUS) recently was acquired by Burr Pilger Mayer (the largest accounting firm in California), though it still runs as a largely autonomous division within.  The team is small, the projects are fast, the impact is huge, and the business puts the focus is on internal quality not just external deliverables.  Plus it's centered around huge masses of data and the extraction of interesting information, which is something I'm quite fond of.  I don't think it's coincidence that Kim is an epidemiologist, doing the job ARMUS's software is intended to facilitate.

Finally, I'm moving my online home to programmerluddite.com instead of barneyb.com.  You can read more about the reasons over there.  I'll still be writing about software development and whatever else suits my fancy, but starting afresh.  This will be the final post I make on this blog, excepting a couple updates to a couple open source projects in the coming month.

This sort of "leave everything behind" transition is very uncharacteristic of my nature.  However, in this case I feel that a transition would be exactly the opposite of what is needed.  Yes, that means you'll need to update your feed readers, but no content is changing homes.  The downloads, post links, and comment subscriptions here on BarneyBlog will remain exactly as before, as will all the other bits and pieces on barneyb.com.

Hope to see you all over at my new home, though I will probably be rather quiet for a bit as the rest of this welling change stabilizes.

The Importance of Development Environments

My first job as a software developer was in 1998, while I was an undergrad at the University of Arizona in Tucson, AZ.  The project was simple: build a JavaScript application to front a website that helped people incorporate a new business.  This was long before "ajax" became the new hotness, but I'd built full windowing environments in JS (supporting both NS4 and IE4) back in the mid '90s so this was old hat for me.  The client had a single environment.  Production.

This was not ideal.

My next job was after I'd dropped out of school, up in Portland, OR.  It was a small web development agency and we had a dedicated development server in our office which we worked on, and then separate production servers colocated offsite.  This was much better, but still meant we (the developers) stepped on each others toes all the time.

Getting better…

Next came working for a university on some internal project, still in Portland, OR.  I was the sole coder, and while I had my full development environment on my local machine, the most striking difference wasn't that it was local (and therefore fast and easy to screw with), but that I didn't have to compete with other devs' changes.  This was a big boon, and my productivity demonstrated it.  But still no version control, if you can believe it.  I still don't understand why that isn't forced on students in CS 101 before the first code project.

Almost there…

My first "real" job was up in Bellingham, WA with a company which sold communications management software.  The whole app was ColdFusion and sold with a SaaS model (about the same time as Salesforce.com was conceived).  There we not only had a production cluster, we had a staging environment, a dev (bleeding edge) environment, and per-developer environments on our local workstations.  It was here that I learned about version control, saw that it was a Good Thing(tm), and put on my tantrum face and forced everything to halt until we had it implemented.  We used CVS, and then upgrade to SVN just after it went 1.0 (after watching it get closer and closer with bated breath).

We've got it!

My next job was back in good old Portland, OR.  I had to create SVN infrastructure, but no worries.  I knew what we needed, especially with six developers.  Also had to create local-workstation development environments, but again a completely worthwhile investment.  The trick was learning to leverage all of this, which brings me to the point.

Every developer should be able to blindly and without regard for any consequences thrash their working codebase to see if random idea X works.

It has taken me a decade to really appreciate how important this simple idea is.

If you can't do this with your current development environment, you should invest some time in making it possible.   It means you need to be able to break any and every piece of code without affecting anyone else.  Including yourself if you need to fix some urgent bug all of a sudden.  It means you should be able to use version control if your idea takes 75 commits to test out, without affecting anyone else.  Including yourself as before.  It means that if your idea works, you shouldn't have to struggle to share your solution and get it into the next nightly.  Even if you're the only developer.

So what do you need?  You need (at least):

  1. version control which supports concurrent editing, branching and merging
  2. a completely isolated instance of your software project with it's own working directory
  3. the ability to create n copies of #2
  4. a habit of small atomic commits (commit early, commit often)
  5. the confidence in your setup to trust that no matter how bad you screw things up, you can always get back to any previous state whether it was five minutes or five months ago

The first points is a no brainer, but is vitally important as it is the only way to get the remaining four items.  Points two and three are different sides of the same thing as #3 is really just #2 where you have two personalities (normal dev and crazy hacker dude).

Point four is a learned skill.  This isn't a piece of technology you adopt or something you do, it's a habit you consciously work to build because you want to be a better developer.  It's hard.  Everyone knows it's far easier to just take a pile of changes you've made and commit them with a message like "fixed stuff".  Suppress that urge.  Don't give in to the easy road.  Spend the time to fix one thing, ensure it works, ensure it's atomic, and commit it.  Then move on to the next thing.  If you're working on eight things at once, first go talk to your boss because they're clearly insane, and then think about point three and do each task in a separate working copy (perhaps even against a separate branch).  This might seem like pure hassle and no benefit, but it's the only way to make the investment you've made with points one to three actually pay off.  Version control repositories with non-atomic commits are certainly better than no version control at all, but the non-atomicity severely hamstrings their primary objective: the ability to act as a time machine.

Which brings me to the final point.  Confidence is a learned skill.  It comes partially from trying and succeeding, but it mostly comes from trying and failing and learning that your precautions keep you safe.  It's important to pick up random idea X and give it a whirl.  Really important.  And the only way to go about it is with the confidence you can recover from anything.  Anything.  If you don't have that confidence, you're going to go about it in a half-ass manner in case something unexpected comes up.  It's human nature; don't feel embarrassed.  What you have to realize is that because we are software developers, we have a distinct advantage.  We have a time machine.  We can go back in time, to any point in time, and start over fresh.  The lack of physical deliverables gives us this advantage, and we'd be fools to not take full advantage of it.

If you can't do that today, I strongly encourage you to think about ways you can get closer.  Don't worry about trying to get there in one step, just get closer.  Do that repeatedly and you'll get there, and every step closer has tangible benefit to future development.  It is worth every ounce of energy it takes to get there.  It's also important to look at new projects with the same mindset; setting yourself (and your fellow developers) up for this from the get-go is simple if you have the forethought to do it.

I have confidence in my development environment.  It's not perfect, but I have no qualms about ripping the heart out of a 300,000 line subsystem and seeing if I can put it back together.  That confidence means I can act.  I don't have to worry about consequences until I can reasonably weigh their ROI.  No more "what ifs".  Interestingly, every Star Wars nerd's favorite line is exactly wrong when applied to software: "do or do not, there is no try" "try, there is no do or do not".

.NET/Silverlight Developer Needed

The State of Oregon Public Health Department needs a rockstar .NET and Silverlight developer to customize a health and hospital information application.  The position is a six-month contract, and is on-site at the State offices in Portland.  There isn't an official job posting up yet (should be later this week), but if you want to send a resume I'll pass it along to the right people.  This looks like a really cool project, both from a technical perspective and from the impact it'll have on people's lives.  If not for the fact I already have a full-time job, I'd be all over this (including having to learn Silverlight in my spare time so I'd qualify).

Adobe, Mobile Flash Player, JavaScript, etc.

Before you click over this because you know me as a Flash hater, give me two seconds.  That's not what this post is about.  It's about a larger issue.  It's about how awesome it is to be a web developer these days.

Ten years ago, being a web developer sucked.  Deployment was easy (rsync to production and done), but the tooling available to us was dismal.  And I mean crying-naked-in-a-snowstorm dismal.  Browsers were inconsistent, their built-in programming environment (read JavaScript) was reasonably functional but horridly slow, and hardware wasn't beefy enough to deal with scripting languages for hard-core number crunching.

But we had the Flash Player.  Flash provided an environment that beat all three of those problems, and beat them soundly.  It was consistent across browsers and operating systems, used a similar language to what we were used to (both JavaScript and ActionScript are ECMAScript implementations), and it let the developer compile the script into something a little lower level to run in a dedicated VM on the user's machine which meant it was faster.  Of course, Flash is an animation toolkit, but we figured out how to bastardize it with the single-frame movie with a include-and-stop script in it so we could build applications entirely with script (like we wanted), but leverage the Flash player to actually run them.  Not to mention the rich support for visual stuff.  All this led to the concept of a RIA (Rich Internet Application).  Something similar to what we had on the desktop, but deployed to the web with all the benefits (and some drawbacks) that has.

Then browsers got their act together.  We started seeing a unifying focus on application development with the browser as the environment.  People got serious about fast JavaScript runtimes.  Standards were written (e.g., CSS2/3, Canvas) and largely adhered to.  And hardware got faster.  JavaScript application frameworks (EXT, YUI, GWT, etc.) showed up to leverage all that, and now it was possible to build RIAs using standards supported by a wide array of vendors.

In order to compete with this, Adobe released Flex, which is an application development framework for deploying to Flash Player.  It was horridly expensive, difficult to work with, had all kinds of implementation problems, but was better than what you got with JavaScript.  For a while.  Unfortunately a single software company, even one of the largest in the world, couldn't hope to compete with the widespread interest and momentum around browser-native RIA development.  Flex died as a web application framework pretty much before it was released.  Which isn't to say it wasn't used (it was and still is), but the browser RIA juggernaut crushed it like a bug.

Then we saw the huge surge in mobile devices.  It started with smartphones and now includes tablets and e-readers of various form factors.  Fortunately for both consumers and manufacturers, the web wasn't new, so they were able to jump right on top of all the standards and browser capabilities which had been created for the desktop.  A huge market segment opened up for web application developers and browser-native was reasonable right out of the box.

Adobe again tried to compete by creating Mobile Flash Player, but the benefits of Flash Player are small on new mobile devices, especially considering that so much of a mobile device experience is through web-connected native applications, not traditional web applications.  And here we have Adobe's smartest move yet around Flash Player: killing it for the mobile market in favor of restructuring the ecosystem around using it as a development environment for native applications.

Unfortunately, I don't think it's going to matter because Flash is still really slow and heavy and it's not really that much better to develop with than the truly native dev kits.  Yes, it offers the promise of cross-platform deploy, but just as Java demonstrated 10-15 years ago, that isn't much of a selling point for a single application.  People expect not just native execution, they expect native idioms, which means you have to develop for a specific platform, even if you're using a cross-platform toolkit.

So where does that leave us?  As consumers it leaves us in a great spot: there are lots of ways developers can deliver engaging applications to us, on all our devices.  As developers it leaves us in a weird spot: we're stuck with being either web or native developers, with Flash now trying to occupy a sort of middle ground (develop web-style, deploy native).  On the desktop I think Flash (via AIR) holds some promise, but ultimately I don't think it'll last.  The platform just isn't compelling enough to justify the dedication it requires to use it.  Java was designed for very much the same purpose as AIR, and virtually no one uses it.  Java moved almost entirely server side, something which Flash isn't likely to do.

Most interesting is the PhoneGap/Titanium/etc. movement which is very much paralleling the browser resurgence I talked about earlier.  Huge communities of people are working to take all the skills we have as web application developers and give us a build process to take web-ish apps and compile them into native applications, in much the same way Adobe has use AIR to compile Flash into native apps.  However, I think Flash is going to lose in exactly the same way, and for exactly the same reasons, as it lost in the browser.

Bottom line, if you use the web or a web-connected device (read: everybody) the world is going to be glorious in a couple years and only get better.  If you're a developer trying to work in that space, you need to learn browser technologies.  It's the way of the future.  Flash had it's run, but it's been on the way out for a long while.  It'll stick around, just like COBOL and Fortran have, but without an alternative path like Java ended up taking it isn't going to stay relevant to mainstream developers.

CFSCRIPT-based Query of Queries Gotcha

I'm hardly the first to blog about this (see here or here), but you're using the new-in-CF9 Query object to execute a query of queries (QofQ), you'll run into scope issues.  Specifically, the CFQUERY tag executes in the context it is written, but the Query object is a CFC and so follows all the normal CFC scoping rules.  For example:

q = queryNew('id,name');
new Query(dbtype='query', sql='select * from q').execute().getResult();

This code will throw an exception saying "Table named q was not found in memory".  Why?  Because Query is a CFC, and inside a CFC method invocation, you can't see the var-scoped variables from the calling method (nor the variables scope of the calling template).  This is standard CFC behaviour, but is quite different than using the CFQUERY tag.  The workaround is to do this:

q = queryNew('id,name');
qry = new Query(dbtype='query', sql='select * from q')
qry.setAttributes(q=q);
qry.execute().getResult();

That places the 'q' variable into the variables scope of the Query instance, so it can be found by the CFQUERY tag inside the execute method.  Note that this is NOT thread safe!!  If you take this approach, you must create a new Query instance for every thread of execution – you can't pool them.  Normally this isn't a big deal, but that setAttributes call is equivalent to assigning to the variables scope and carries all the lurking problems.

If you like chaining, you can do this (which is functionality equivalent, including the thread safety concerns):

q = queryNew('id,name');
new Query(dbtype='query', sql='select * from q', q=q).execute().getResult();

Here we're just passing the attribute in via the init method, instead of a separate setAttributes call.  Like I said, it's no different in terms of functionality, but it's a lot nicer (I think) in terms of readability and conciseness.

In my opinion, this is sort of broken.  ColdFusion is all about server-encapsulated hacks in encapsulation.  Think about your use of CFQUERY: you just use it, and CF takes care of getting the right database handles and such for you.  It just works.  It's totally unencapsulated, but it just works.  And it's glorious.  The Query CFC has the same characteristics, except for QofQ.  Not that I expect it to change, but I'm of the opinion that the Query object ought to work like CFQUERY in this regard.

Migration Complete!

This morning I cut barneyb.com and all it's associated properties over from my old CentOS 5 box at cari.net to a new Amazon Linux "box" in Amazon Web Service's us-east-1 region.  Migration was pretty painless.  I followed the "replace hardware with cloud resources" approach that I advocate and have spoken on at various places.  The process looks like this:

  1. launch a virgin EC2 instance (I used the console and based it on ami-7f418316).
  2. create a data volume and attach it to the instance.
  3. allocate an Elastic IP and associate it with the instance.
  4. set up an A record for the Elastic IP.
  5. build a setup script which will configure the instance as needed.  I feel it's important to use a script for this so that if your instance dies for some reason you can create a new one without too much fuss.  It's not strictly necessary, but part of the cloud mantra is "don't repair, replace" because new resources are so inexpensive.  Don't forget to store it on your volume, not the root drive or an ephemeral store.  Here's one useful snippet for modifying /etc/sudoers that took me a little digging to figure out:
    bash -c "chmod 660 /etc/sudoers;sed -i -e 's/^\# \(%wheel.*NOPASSWD.*\)/\1/' /etc/sudoers;chmod 440 /etc/sudoers"
  6. rsync all the various data files from the current server to the new one (everything goes on the volume; symlink – via your setup script – where necessary).  Again, use a script.
  7. once you're happy that your scripts work, kill your instance,
  8. launch a new virgin EC2 instance,
  9. attach your data volume,
  10. associate your Elastic IP,
  11. run your setup script,
  12. if anything didn't turn out the way you wanted, fix it, and go back to step 8.
  13. shut down all the state-mutating daemons on the old box.
  14. shut down all the daemons on the new instance.
  15. set up a downtime message in Apache on the old box.  I used these directives:
    RewriteEngine  On
    RewriteRule    ^/.+/.*    /index.html    [R]
    DocumentRoot   /var/www/downtime
  16. run the rsync script.
  17. turn on all the daemons on your new instance.
  18. add /etc/hosts records to the old box and update DNS with the Elastic IP.
  19. change Apache on the old box to proxy to the new instance (so people will get the new site without having to wait for DNS to flush).
    ProxyPreserveHost   On
    ProxyPass           /   http://www.barneyb.com/
    ProxyPassReverse    /   http://www.barneyb.com/

    These directives are why you need the rules in /etc/hosts, otherwise you'll be in an endless proxy loop.  You'll need to tweak them slightly for your SSL vhost.  The ProxyPreserveHost directive is important so that the new instance still gets the original Host header, allowing it to serve from the proper virtual host.  This lets you proxy all your traffic with a single directive and still have it split by host on the new box.

The net result was a nearly painless transition.  There was a bit of downtime during the rsync copy (I had to sync about 4GB of data), but only a few minutes.  Once the new box was populated and ready to go, the proxy rules allowed everyone to keep using the sites, even before DNS was fully propagated.  Now, a few hours later, the only traffic still going to my old box is from Baiduspider/2.0; +http://www.baidu.com/search/spider.html, whatever that is.  Hopefully it'll update it's DNS cache like a well-behaved spider should, but not according to my TTLs.  Hmph.

Steps 1-12 (the setup) took me about 4 hours to do for my box.  Just for reference, I host a couple Magnolia-backed sites, about 10 WordPress sites (including this one), a WordPressMU site, and a whole pile of CFML apps (all running within a single Railo).  I also host MySQL on the same box which everything uses for storage.  Steps 13-19 took about an hour, most of that being waiting for the rsync and then running through all the DNS changes (about 20 domains with between 1 and 10 records each).

And now I have extra RAM.  Which is a good thing.  I'm sure a few little bits and pieces will turn up broken over the next few days, but I'm quite happy with both the process and the result.

JIRA Subtask Manager

If you use subtasks in JIRA, you've likely had issues with trying to manage them.  It's very difficult to get a comprehensive view of subtasks for a ticket all in one screen.  Using a little tweak to the JIRA configuration and a small Greasemonkey script, I think I've made the process much easier.

First, you need to enable the 'description' field in the subtask list.  Open up your '/WEB-INF/classes/jira-application.properties' file and find the 'jira.table.cols.subtasks' property.  Add 'description' to it (I put it after 'summary').  Similarly, find the 'jira.subtask.quickcreateform.fields' property and add 'description' to it also.  Then save and restart JIRA.

Second, load up this userscript (make sure to fix the server to match your JIRA installation) into Greasemonkey.  It does a few things, but primarily it moves the description into its own row and changes the summary to link to the edit form instead of the browse page.

Also note that if you want to reorder the subtasks more efficiently, you can very easily tweak the URLs for the up/down links to move more than one step at a time.  If you copy the URL of one of the links you'll see two numbers in it.  The first number is the current position of the issue and the second is the desired position.  They'll always be one apart, but you can change the second number to any valid position to jump the subtask to that position in the list.  Hardly an ideal interface, but it's faster than moving one step at a time if you have to move more than a couple spots.  Eventually I'll probably improve that in the userscript to some level, but not exactly sure how I want it to work so I'm leaving it as-is for now.

Update 2011-09-16: I've made a couple additional tweaks to the script and removed the inline version.  Completed subtasks are now greyed out and descriptions hidden so they're not as intrusive.

Syncing Files on the Cloud

A couple weeks ago, I gave my "Replace Your Iron With A Cloud" talk from cf.objective to the CFMeetUp.  If you didn't catch it, you can view the recording on that page.  Several people both during and after the presentation had questions about syncing files between servers.  This doesn't really have anything to do with hosting in the cloud in particular, as you'll have the same problem with multiple physical servers, but the cost effectiveness of cloud computing is often the means by which smaller applications move from a single (physical) server to multiple (cloud) servers.

Due to those questions, I wanted to run through a few points on the syncing issue here, rather than continuing to answer individual emails/tweets/etc.  I'm going to talk about a hypothetical blogging application, as blogs are simple and well understood, and at least for this discussion, illustrate all the interesting bits and pieces.

Before we get started, if you're not using version control, stop reading, go set it up (either yourself or through one of the myriad free/cheap hosting providers), and check your code in. I like Subversion. Git seems to be the current hotness. Stay away from CVS and Visual SourceSafe. Then you can come back and read the rest of this.

First and foremost, every piece of data has a single canonical home.  That is, without a doubt, the most important thing to keep in mind.  So what is the canonical home for these data?  It depends on the type, and as I see it, there are four main ones (with their canonical home):

  1. server configuration: version control (unless you're Windows/IIS/etc., then you have to be more creative)
  2. source code and assets: version control
  3. user data (e.g., blog posts): your database cluster (you should think of it as a cluster, even if there is only one node)
  4. user assets (e.g., uploaded photos): in the database or on a dedicate hosting service (your own, or third party)

Most of those places aren't very useful for running your application.  So you invariably copy your data around, caching it in various locations for various purposes.  What you have to remember is that every copy of any piece of data outside the canonical home is exactly that: a copy.  If you change it, it doesn't mean anything until the change is propagated back to the canonical home.  Think about your source code.  You're happily hacking away on your local development machine and build some awesome new feature.  That's well and good, but it doesn't "count" until it's been checked into version control.  Your local box is a copy of the data, not the real data.  That same mindset applies for all data, regardless of type.

Let's say our blogging app runs on three application servers atop a two-node database cluster.  We'll assume the cluster is already doing it's job of keeping data synced between the nodes, handling failover if there is a problem, etc.  The first thing is server config.  How do we ensure all three application servers are configured identically?  We need to push a copy of the configuration from the canonical home to each server.  Then we know they're identical.  When we need to update configuration, we have to do the same thing.  You make and test the changes somewhere (hopefully on a test site, but perhaps on one of the production boxes), push those modifications back to the canonical home, and then update each server's copy of the data.  Do NOT, under any circumstances, log into each box and manually make the same configuration change on each one.  It'll work, as long as you're perfect, and nobody is perfect.

Now that we now our application servers are all configured the same, the next task is to ensure the code is the same across all of them.  You could (and might want) to consider your code to be part of your server configuration, but for right not lets assume you have servers which will be configured once and run multiple sequential versions of the application.  The best way to get your code from version control to your servers is via a single button click.  This isn't necessarily practical for all situations, but it is an ideal.  The approach I typically take is two button clicks: one to build a release from the current development branch and put it into a special area of the version control repository, and the second to take a release and deploy it to one or more of the production servers.  Note that the "build" process might be nothing more than copying files if you have a simple app, but more likely will at least involve aggregating and compressing JS/CSS resources and possibly compiling code.

As for actually getting the code to your servers, my tool of choice is rsync.  It does differential pushes, so if you have a 400MB app (like we do at Mentor) you can still push out releases quickly since most of that won't change every time.  Rsync runs on all major platforms, and doesn't have issues crossing platform boundaries.  It's there by default on pretty much all modern *nixes, and it's simple to set up on Windows with one of various prebundled packages.

Alright, now we can be confident that our app servers are all synchronized for the developer data portions of our full data set.  That's the easy part.  The user data is a bit trickier.  For the blog posts, comments, user accounts, etc. the solution is simple.  We have a single database cluster that all of our app servers talk to in realtime for reads and writes.  The database cluster will take care of consistency and availability through whatever magic is needed and supplied by the database vendor.  My typical configuration, just for reference, is two or more MySQL servers running circular (master-master) replication.  Whatever your platform is, you can pretty much get the equivalent.

If you have huge data volume and/or large scaling capacity, this approach probably isn't ideal.  You'll need to look at a distributed database of some sort, either sharding a relational database, or use a more purpose-specific database (e.g., Apache Cassandra) which is designed around a distributed store.  I'm not talking about these use cases.  If you have this kind of problem, hire someone to help solve it.  At that scale there is no such thing as a general purpose solution.  General tools, yes, but the specific use of them is bound tightly to your application's needs.

Now the worst of the four: user assets.  Why are these bad?  Because they're typically accessed directly.  For example, if a user uploads a photo to display in their blog post, the web browser wants to request that image after it gets the page with the post in it.  Which means with our three app servers, when we put the post into the database with the image reference, we need to get the image itself onto the other two application servers.  This is the wrong approach.  Remember the canonical home?  That can't be three places, it has to be one.

The most obvious might be to just stick the image into a BLOB field in the database.  That certainly works, it's tried and true, and it doesn't require any new infrastructure.  However, it means that every request for an image had to go through your application code to make a SQL query to get the bits and then stream it back to the browser.  That can be a big performance issue, especially if you have large files, as while the bits are streaming out, you're consuming both a database connection and an app server thread.  If it takes the browser 20 seconds to pull down a huge image and you're burning a thread and a connection for that entire duration, you're going to run into scalability issues very quickly.  However, if you have low concurrency, this is a perfectly viable approach.

A much better solution, especially for assets which need to be served back to clients directly, is to set up a static file hosting "cluster" analogous to the database cluster.  That means one or more nodes which will internally take care of data synchronization amongst them.  This may seem like we're back to the same "sync the file across the app servers" problem, but it's not.  The difference is that the application now has a single place to stick something (the cluster) so it doesn't care about replicating it for durability/scalability.  We can separate that concern out of the application and into a dedicated system.  In effect, this is just another database system, except it's managing filesystem assets instead of tables of data, but we don't care too much. It's the separation that is important.

This sort of model also allows multiple applications to interface with the same assets far more easily than if the assets are stored within an application's database.  Say this blogging application is really successful and you want to spin off a photo sharing site.  It'll be a separate application, but you want to let users easily leverage their blog photos on the photo sharing side, and be able to easily insert photos from their galleries into their blog posts.  By having the images stored outside of the blogging application, it becomes far more easy to reuse them.

And this brings us to my favorite of the cloud services from Amazon: Simple Storage Service or S3.  S3 is exactly what I've outlined in the two previous paragraphs.  It's incredibly inexpensive, ridiculously scalable, and has interface libraries available for pretty much every major language you could want.  It's simple to use, supports very fine grained security (should you need to have private or semi-private assets), and just generally solves a whole bunch of nasty problems.

So to wrap this all up, I'm going to run through the flows:

First, when we want a new server, we configure it from our repository.  If we need to change the configuration, we change it in the repository and push it out to all the servers.

Next, when we want to change the application, we make our mods and commit it to the repository, and then use an automated process (which, I might add, should be configured from version controlled configuration) to build a release and push releases out to our application servers.

When a user is using our application, we'll take any data they give us and place it in a single location for use by all application servers.  In the case of simple data (numbers, strings, etc.) it'll go into some kind of database.  For binary data (images, zips, etc.) it'll go into an asset store.  In either case, all accesses and modifications to the data by the application are directed at a single repository, so there is no need to do synchronization.

Oof.  That's a lot.

The last thing is to consider caching.  For this particular example, let's assume we're going to stick our blog uploads into the database.  We don't have the resources/ability/time to adopt something like S3, and the limitations on using the database for BLOB storage are acceptable.  So how can we optimize?  Well, the easiest thing is to make BLOBs immutable (so if you want to upload a new version of something, you'll get a new BLOB rather than update the existing BLOB, and the application will track lineage), and then you can cache them wherever you want without having to worry about change.  For example, you have your images at URLs of the style /files/?id=123.  When a request comes in, the application server looks on it's filesystem for the image to stream back.  If it's not there, it'll query the database to get it, and then both write it to the filesystem and stream it out.  On the next request, the file will already be there, so the database won't be queried.

We can do better than that, though, because we really don't want the application server involved on those cached requests.  By using something like Apache HTTPD's mod_rewrite, we can have URLs like /files/123.jpg.  When a request comes in and the file exists, Apache will stream it back without involving our application at all. If the file doesn't exist, Apache will rewrite it to something like /files/?id=123 which will do the same thing as before: query from the database, save to the filesystem for future requests, and stream the file back to the browser.

Why did I stress immutability before?  Because this mechanism, while very simple, has no way of detecting if a locally cached file has been updated in the database.  There are various ways to get around this issue, and I'm not going to go into details.  What's important is that user data, just like your source code, doesn't have to be used directly from it's canonical home – it can absolutely be copied around for whatever reason (usually performance) as long as you still have a single canonical home for it.

This has now morphed into an incredibly long post.  I probably ought to have broken it up into several pieces.  Perhaps next time.  :)

Boggle Boards

In case anyone wants to know, here are the specs for Boggle – both for Original Boggle and for Big Boggle – in a handy machine-readable format. The format is line oriented with each line representing a single die, and the sides of the dice delimited by spaces. Note that there is a side with 'Qu' on it, so you must allow for a multi-letter side in your parser.

original_boggle.txt

A A C I O T
A B I L T Y
A B J M O Qu
A C D E M P
A C E L R S
A D E N V Z
A H M O R S
B F I O R X
D E N O S W
D K N O T U
E E F H I Y
E G I N T V
E G K L U Y
E H I N P S
E L P S T U
G I L R U W
big_boggle.txt

A A A F R S
A A E E E E
A A F I R S
A D E N N N
A E E E E M
A E E G M U
A E G M N N
A F I R S Y
B J K Qu X Z
C C E N S T
C E I I L T
C E I L P T
C E I P S T
D D H N O T
D H H L O R
D H L N O R
D H L N O R
E I I I T T
E M O T T T
E N S S S U
F I P R S Y
G O R R V W
I P R R R Y
N O O T U W
O O O T T U

Usage

Here is a simple CFML function which accepts a board definition and returns an array representing a "roll" of the grid:

<cffunction name="roll" output="false" returntype="array">

   <cfargument name="board" type="string" required="true" />
   <cfset var result = [] />
   <cfloop list="#board#" index="die" delimiters="#chr(10)#">

      <cfset die = listToArray(die, ' ') />
      <cfset arrayAppend(result, die[randRange(1, arrayLen(die))]) />
   </cfloop>
   <cfset createObject('java', 'java.util.Collections').shuffle(result) />
   <cfreturn result />
</cffunction>

And the result of executing it on the original board:

[N, T, W, A, B, Qu, B, D, L, M, D, H, D, L, G, V]

Here's a Groovy Closure which does the same thing:

{
   it = it.tokenize('\n')
      .collect{ it.tokenize(' ') }
      .collect{ it[new Random().nextInt(it.size)] }
   Collections.shuffle(it) // icky!
   it
}

And the result of executing it on the big board:

[N, E, G, N, N, C, S, A, F, E, U, N, H, I, P, H, C, T, T, O, I, O, Qu, T, T]

NB: The source for this post can be found at http://www.barneyb.com/boggle/.  Any updates I may have will go there.

Visiting Recursion

Recursion is a very powerful technique, but when processing certain types of data structures you can end up with problems.  And not the "compiler error" kind of problems, the bad semantic kind.  Fortunately, at least one class of them is pretty easy to solve with visit tracking, though it's a subtle solution.  But I'm going to build up to that from bare-bones recursion first.

As you know, a recursive algorithm is one which splits the problem into two parts: the base case and the recursive case.  The base case is a fixed solution to a simple form of the problem the algorithm is designed to solve.  The recursive case, on the other hand, is a dynamic solution which is defined as "a little work plus the solution to a simpler form of the same problem".  That simpler form of the problem is solved by reinvoking the recursive algorithm, and eventually reaches the base case (which breaks the cycle).  Consider computing factorials, which is the de facto standard for discussing recursion:

function factorial(n) {
  if (n < 0)
    throw 'NegativeArgumentException'
  return n == 0 || n == 1 ? 1 : n * factorial(n - 1)
}

Here the conditional is used to pick between the base case and the recursive case.  As you can see, the recursive case is defined by a small bit of work (multiplying by 'n') and then solving a simple form of the problem (the factorial of n – 1).  The most important thing to note is that the recursive case is guaranteed to eventually recurse down to the base case (or raise an exception).  That means infinite recursion is impossible, which is good.  Now onward to the interesting stuff…

Consider this data structure:

barney = {
  "name": "Barney",
  "sex": "male",
  "dob": "1980-06-10",
  "children": [
    {
      "name": "Lindsay",
      "sex": "female",
      "dob": "2004-01-09",
      "children": []
    },
    {
      "name": "Emery",
      "sex": "male",
      "dob": "2005-08-12",
      "children": []
    }
  ]
}

This is a hierarchical data structure (i.e., a tree), where there is a single root node (the structure named "Barney") and then some number of child nodes.  We can see here that there are only two levels ("Barney" and "Barney's children"), but you can imagine here in 40 years that there will very likely be at least one more level, probably two more.  The point is that we don't know how deep the structure is, but we do know that it isn't infinitely deep.  It has to end somewhere.

How could I find the average age of everyone?  Well, I'd use recursion to do it, but if you think about what we need to do, you'll quickly see there's a problem.  Computing an average has to be done after all the aggregation is complete – you can't average subaverages and have it come out right unless you also weight the subaverages.  So we'll need to either sum the total age and divide by the number of people or compute subaverages and track the number of people "within" them.  I'm going to take the first approach.  Here are my functions:

function averageAgeInSeconds(family) {
  var result = {
    totalAge: 0,
    count: 0
  }
  averageAgeParts(family, result)
  return Math.round(result.totalAge / result.count)
}
function averageAgeParts(family, result) {
  for (var i = 0, l = family.length; i < l; i++) {
    result.totalAge += getAgeInSeconds(family[i])
    result.count += 1
    averageAgeParts(family[i].children, result)
  }
}
function getAgeInSeconds(person) {
  return Math.floor((new Date().valueOf() - Date.parse(person.dob)) / 1000)
}

As you can see, I'm creating a 'result' structure for storing my aggregates, then using the 'averageAgeParts' helper function which is where the recursion happens.  After the aggregates are done, I'm doing the division in the main function to get the average.  Note that I don't have an explict base case anywhere.  The reason is that it's impossible for the data structure to be infinitely deep; I'm relying on that fact to act as my base case and let the recursion bottom out.  In more direct terms, at some point the loop inside averageAgeParts will be traversing an empty array (e.g., Lindsay's children), which means the recursive call will not be invoked (the base case).

Just for reference, the average age is about 14.7 years.

The important thing to note here is that my main function isn't directly recursive.  Instead it delegates to a recursive helper method and supplies some additional context (the 'result' variable) for it to operate on while wakling the tree.  This extra stuff is critical for solving a lot of problems with recursion.  I'm not going to show the implementation, but consider how you'd change this so you could request the average age of a family, but constrain it to only females (be careful to ensure you count female descendents of males).  How about if you wanted to allow counting a certain number of generations (regardless of the total tree depth)?

Now on to the whole point: visitation.  The data structure I've shown to this point is a tree, as we discussed.  But it doesn't represent the real world: a child has two parents, not one.  That's no longer hierarchical, so we can't stick with a tree, we need to generalize into a graph.  (Just to be clear, trees are graphs, but with the extra constraint of hierarchy.)  So how might our structure look now?  First of all, we can't represent it with a literal; we'll need to define a tree structure first and then add some extra linkages to "break" the tree nature and revert it to just a graph.  Using the 'barney' object from above, here's how our graph might look.

boisverts = [
  barney,
  {
    "name": "Heather",
    "sex": "female",
    "dob": "1980-02-12",
    "children": Array.concat(barney.children)
  }
]

The last line creates a 'children' array for Heather which contains the same contents as Barney's 'children' array (but is a separate array).  Now we can access Emery as either boisverts[0].children[1] or as boisverts[1].children[1] and it's the same object.  That's important: it's what makes this a graph instead of a tree.  So now what happens when we run our averageAgeInSeconds function on 'boisverts'?  It'll run just dandy, but while the correct answer is 18.9 years, the result will be 14.8 years.  The reason is that it'll count Emery and Lindsay twice (once as Barney's children and again as Heather's children).

What we need is some way to keep track of what we've already processed (visited) so we can avoid processing stuff multiple times, and we can do that with a new subkey in our existing 'result' structure:

function averageAgeInSeconds(family) {
  var result = {
    totalAge: 0,
    count: 0,
    visited: []
  }
  averageAgeParts(family, result)
  return Math.round(result.totalAge / result.count)
}
function averageAgeParts(family, result) {
  for (var i = 0, l = family.length; i < l; i++) {
    if (visit(family[i], result)) {
      result.totalAge += getAgeInSeconds(family[i])
      result.count += 1
      averageAgeParts(family[i].children, result)
    }
  }
}
function visit(o, result) {
  for (var i = 0, l = result.visited.length; i < l; i++) {
    if (result.visited[i] === o) {
      return false
    }
  }
  result.visited.push(o)
  return true
}

What I've done is create a 'visit' function to keep track of which objects have been visited.  Now 'averageAgeParts' uses 'visit' to check if it should visit (process) an object.  I've implemented 'visit' with an array as I wanted illustrative clarity not performance.  In the real world you'd use a hashtable instead of an array so you'll have O(n) peformance instead of O(n2).

Now when we run this we'll get 18.9 years, as expected, because even though Emery and Lindsay will be iterated over twice, they'll only be processed the first time.  More specifically, the 'visit' function will return false on the second pass so the body of the loop will be skipped and their values won't be added to the aggregates the second time.

This sort of problem crops up all the time in computer science, and while I've shown it here with recursion, it's not necessarily tied to recursive algorithms.  One of the ones which is quite common, and fortunately "below the radar" for a lot of people, is garbage collection in modern virtual machines.  Another place is with serializing objects, either for permanent storage or transport across the wire.  The specific solutions are different, but they all end up using some sort of visit tracking on the graph nodes or edges.

NB: This post was prompted by a discussion on the Taffy users mailing list, but as it has myriad other implications, I've discussed the topic in a more general way.

NB: All code is JavaScript, and should run in any reasonable environment.  I did all my testing in Firefox 3.5, but I'd expect this to work back in Netscape 4.7 and IE 4.

http://groups.google.com/group/taffy-users