BarneyBlog » potd

Migration Complete!

barneyb — Wed, 28 Sep 2011 22:56:19 +0000

This morning I cut barneyb.com and all it's associated properties over from my old CentOS 5 box at cari.net to a new Amazon Linux "box" in Amazon Web Service's us-east-1 region.Â Migration was pretty painless.Â I followed the "replace hardware with cloud resources" approach that I advocate and have spoken on at various places.Â The process looks like this:

launch a virgin EC2 instance (I used the console and based it on ami-7f418316).
create a data volume and attach it to the instance.
allocate an Elastic IP and associate it with the instance.
set up an A record for the Elastic IP.
build a setup script which will configure the instance as needed.Â I feel it's important to use a script for this so that if your instance dies for some reason you can create a new one without too much fuss.Â It's not strictly necessary, but part of the cloud mantra is "don't repair, replace" because new resources are so inexpensive.Â Don't forget to store it on your volume, not the root drive or an ephemeral store.Â Here's one useful snippet for modifying /etc/sudoers that took me a little digging to figure out:
```
bash -c "chmod 660 /etc/sudoers;sed -i -e 's/^\# \(%wheel.*NOPASSWD.*\)/\1/' /etc/sudoers;chmod 440 /etc/sudoers"
```
rsync all the various data files from the current server to the new one (everything goes on the volume; symlink – via your setup script – where necessary).Â Again, use a script.
once you're happy that your scripts work, kill your instance,
launch a new virgin EC2 instance,
attach your data volume,
associate your Elastic IP,
run your setup script,
if anything didn't turn out the way you wanted, fix it, and go back to step 8.
shut down all the state-mutating daemons on the old box.
shut down all the daemons on the new instance.

set up a downtime message in Apache on the old box.Â I used these directives:

RewriteEngine  On
RewriteRule    ^/.+/.*    /index.html    [R]
DocumentRoot   /var/www/downtime

run the rsync script.
turn on all the daemons on your new instance.
add /etc/hosts records to the old box and update DNS with the Elastic IP.
change Apache on the old box to proxy to the new instance (so people will get the new site without having to wait for DNS to flush).
```
ProxyPreserveHostÂ Â  On
ProxyPassÂ Â          /Â Â  http://www.barneyb.com/
ProxyPassReverseÂ Â Â  /Â Â  http://www.barneyb.com/
```
These directives are why you need the rules in /etc/hosts, otherwise you'll be in an endless proxy loop.Â You'll need to tweak them slightly for your SSL vhost.Â The ProxyPreserveHost directive is important so that the new instance still gets the original Host header, allowing it to serve from the proper virtual host.Â This lets you proxy all your traffic with a single directive and still have it split by host on the new box.

The net result was a nearly painless transition.Â There was a bit of downtime during the rsync copy (I had to sync about 4GB of data), but only a few minutes.Â Once the new box was populated and ready to go, the proxy rules allowed everyone to keep using the sites, even before DNS was fully propagated.Â Now, a few hours later, the only traffic still going to my old box is from Baiduspider/2.0; +http://www.baidu.com/search/spider.html, whatever that is.Â Hopefully it'll update it's DNS cache like a well-behaved spider should, but not according to my TTLs.Â Hmph.

Steps 1-12 (the setup) took me about 4 hours to do for my box.Â Just for reference, I host a couple Magnolia-backed sites, about 10 WordPress sites (including this one), a WordPressMU site, and a whole pile of CFML apps (all running within a single Railo).Â I also host MySQL on the same box which everything uses for storage.Â Steps 13-19 took about an hour, most of that being waiting for the rsync and then running through all the DNS changes (about 20 domains with between 1 and 10 records each).

And now I have extra RAM.Â Which is a good thing.Â I'm sure a few little bits and pieces will turn up broken over the next few days, but I'm quite happy with both the process and the result.

Pic of the Day Minicards

barneyb — Sat, 20 Mar 2010 04:18:00 +0000

I always get lots of questions about my Pic of the Day minicards, so I made a page on the PotD site with info about them both in general and the individual runs.Â As always, NSFW.

The minicards are the only actual "marketing" I do for PotD aside from occasionally link to it so it can get indexed by search engines.Â And it hardly counts as marketing, though after distributing some I do typically see an uptick in subscriptions.Â The real objective is to provide a slick and tangible item to start conversation, and the cards shine at that.

What's been very interesting to me is that no one hands them back.Â You'd think that if you handed someone a piece of cardstock with a naked picture on it that some people would refuse.Â But that's not been my experience at all.Â Some people have an obvious aversion to the concept of pornography (though whether it's a facade or not is a different question), but naked or not, the cards always prompt a "what is Pic of the Day?" not a "I don't want this."Â And not one person has ever terminated the conversation because of the nudity.

Yes, my sample is not representative of the general population.Â My friends and associates are certainly a younger and more liberal (or at least open-eyed) segment.Â And I don't mean "liberal" as in political bent, but rather in a broader sense.Â Despite this, it's still an interesting outcome.Â The project – which is coming up on it's sixth birthday – has proven a nearly unending source of interesting bits and pieces.

Moving Pic of the Day Foiled Again

barneyb — Thu, 11 Mar 2010 06:03:10 +0000

A while back I made an attempt to move Pic of the Day (NSFW) off of ColdFusion 8 and onto Railo 3.Â I can't afford a license of CF9, so my only upgrade path is through a free alternative.Â Unless someone has an extra four grand they want to give me….

Last time I was foiled by CFHTTP adding a spurious Content-Type header on GET requests, which breaks secure downloads from S3 (which is where I host all the content).Â I reported the bug and it got fixed, but I hadn't had time to revisit the migration process so there it sat.Â Until this evening, that is.

I'm glad to say that the issue with GET requests has been completely resolved.Â The bleeding edge is also a lot smoother than last time I pulled down a new version, so props to those guys.Â Setting up a migration test environment actually proved pretty straightforward, even with all the crazy Apache and OS integration PotD leverages.

As expected, there were errors on the first page load, but nothing some trickery with mappings and rewriting a couple query of queries couldn't fix.Â After that, everything just worked.Â Thumbnail generation, S3 access, emailing, everything.Â Except that it wasn't everything.Â Turns out that exactly the same problem I had with GET requests before has no manifested itself with DELETE requests.Â So I'm again stuck.

The way PotD is implemented, images are spidered and pushed immediately onto S3.Â Then they go through the filter pipeline, and many (most?) of them are deleted.Â So being able to remove stuff from S3 is a pretty core feature, otherwise I'd have piles and piles of orphaned files up there, and that just costs me money for no reason.Â Sadly, this makes Railo a no-go again, and leaves me with CF8 for a while longer.

I've actually got a lot of stuff in the works surrounding my personal sites and projects, but the CF to Railo conversion is one of the larger ones as well as the one with the largest potential impact on server resources (which I'm continually constrained by).Â The move from JRun to Tomcat was a huge help, but I could definitely use more and Railo gives all apperances of being able to give it to me.Â Also have some major WordPress infrastructure changes, a whole rebranding of this (my blog), and a few other corollary improvements.

The overarching goal is to simplify my URL space so I don't have as much interleaving between separate applications.Â www.barneyb.com's URL space, for example, houses 3 different blogs, two static sites, and a pile of little CFML apps.Â ssl.barneyb.com houses SVN, Trac, PotD, and several other CFML apps.Â It's a mess, but that'll be a lot better, regardless of what happens with the CFML engine stuff.

Scaling Averages By Count

barneyb — Wed, 03 Mar 2010 03:49:30 +0000

One of the problems with statistics is that they work really well when you have perfect data (and therefore don't really need to do statistics), but start falling apart when the real world rears it's ugly head and gives you data that isn't all smooth.Â Consider a very specific case: you have items that people can rate and then you want to pull out the "favorite" items based on those ratings.Â As a more concrete example, say you're Netflix and based on a person's movie ratings (from 1-5 stars), you want to identify their favorite actors (piggybacking the assumption that movies they like probably have actors they like).

This is a simple answer to derive: just average the ratings of every movie the actor was in, and whichever actors have the highest average are the favorites.Â Here it is expressed here in SQL:

select actor.name, avg(rating.stars) as avgRating
from actor
  inner join movie_actor on movie_actor.actorId = actor.id
  inner join movie on movie_actor.movieId = movie.id
  inner join rating on movie.id = rating.movieId
where rating.subscriberId = ? -- the ID of the subscriber whose favorite actors you want
group by actor.name
order by avgRating desc

The problem is that – as an example – Tom Hanks was in both Sleepless in Seattle and Saving Private Ryan.Â Clearly those two movies appeal to different audiences, and it seems very reasonable that someone who saw both would like one far more than the other, regardless of whether or not they like Tom Hanks.Â The next problem is if they've only seen one of those movies, the ratings are going to paint an unfair picture of Tom Hanks' appeal.Â So how can we solve this?

The short answer is that we can't.Â In order to solve it, we'd have to synthesize the missing data points, which isn't possible for obvious reasons.Â However, we can make a guess based on other datapoints that we do have.Â In particular, we know the average rating for all movies for a user, so we can bias "small" actor samples towards that overall average.Â This will help mitigate the dramatic effect of outliers in small sample sizes when there aren't enough other datapoints to mitigate them.

In other words, instead of this:

we can do something like this:

This simply takes the normal average from above, and "scoots" it towards the overall average based.Â The denominator is a constant picked by me (more later) raised to the power equal to the number of samples we have.Â This way as the number of samples goes up, the magnitude of the correction falls rapidly.Â Here's a chart illustrating this (the x axis is a log scale):

With only one sample, the per-actor average will be scooted 87% of the way towards the overall average.Â With four samples the correction will be only 57%, and by the time you get 32 samples there will be only a 1% shift.Â Note that those percentages are of the distance to the overall average, not any absolute value change.Â So if a one-sample actor happens to be only 0.5 stars away from the overall average, the net correction will be 0.465.Â However, if a different one-sample actor is 1.5 stars away from the overall average, the net correction will be 1.305.

Of course, I'm not Netflix, so my data was from PotD, but the concept is the identical.Â The "1.15″ factor was derived based on testing on the PotD dataset, and demonstrated an appropriate falloff as the sample size increased.Â Here's a sample of the data, showing both uncorrected and corrected averages ratings, along with pre- and post-correction rankings:

Model	Samples	Average	Corr. Average	Rank	Corr. Rank
#566	22	4.1818	4.1310	46	1
#375	12	4.1667	3.9640	47	2
#404	13	4.0000	3.8509	81	3
#1044	7	4.2857	3.8334	44	4
#564	5	4.4000	3.7450	42	5
#33	32	3.7500	3.7424	176	6
#954	4	4.5000	3.6895	40	7
#733	4	4.5000	3.6895	39	8
#330	7	4.0000	3.6551	74	9
#293	5	4.2000	3.6444	45	10

In particular, model #33 sees a huge jump upward because of the number of samples.Â You can't see it here, but the top 37 models using the simple average are all models with a single sample (a 5-star rating), which is obviously not a real indicator.Â Their corrected average is 3.3391, so not far off the leaderboard, but appreciably lower than those who have consistently received high ratings.

For different size sets (both overall, and expected number of ratings per actor/model) the factor will need to be adjusted.Â It must remain strictly greater than one, and is theoretically unbounded on the other end but there is obviously a practical/reasonable limit.

Is this a good correction?Â Hard to say.Â It seems to work reasonably well with my PotD dataset (both as a whole, and segmented various ways), and it makes reasonable logical sense too.Â The point really is that if you don't care about correctness, you can do some interesting fudging of your data to help it be useful in ways that it couldn't otherwise be.

Rebuilding Pic of the Day

barneyb — Wed, 21 Oct 2009 17:46:46 +0000

I need some help, thoughts, recommendations as I undertake this, but first some background…

As I do every 15-18 months, I've decided that it's time to rebuild Pic of the Day.Â I've never actually done it; the codebase is still the same one I started 5-6 years ago and have edited (often daily) since then.Â But the amount of cruft is becoming more and more problematic, and while I could do a hard-core refactoring and trimming down of the app, I don't see a compelling benefit to doing it that way versus a ground-up rewrite, and I'm confident the latter will actually be quite a bit faster.

In the past I've created partial re-implementations with pure CFML, Spring/Hibernate, Grails, and CFML/Groovy hybrids.Â In every case, one of the objectives was a gradual migration, where the two versions either shared a database, or did incremental data copies from old to new, so the app could be ported in stages.

I've decided I really don't want to do that.Â Obviously I need to move data from old to new, but I'm happy with just doing the pic/recipient/rank tuples and the associated entities, and starting from scratch with the other bits (the spider state, the image pool, historical records, etc.).

My question for all of you is really about the technology stack.Â As I mentioned above, I've tried several.Â Time-to-market would be maximized with a CFML-centric solution, because that's what I have the best infrastructure and tooling for, but that's not a significant driver.Â PotD is a hobby; it's how I entertain myself for hours every night after the kids are in bed.Â I do have resource constraints on my server, particularly RAM, so that is a consideration, but other than that I'm pretty much open for anything.

If you were undertaking this project, what would you use and why?Â If you don't supply the why, I'm deleting your comment.Â :)

Edit Distances Bug

barneyb — Sat, 26 Sep 2009 05:56:22 +0000

This evening I found a bug in one of the optimizations that I made to the edit distance function.Â I've corrected the code in the original post, and made a note of the change there as well.Â Just wanted to mention it in a second post so anyone who read via RSS will be aware of it (since they won't necessarily go back and look at the original).

Edit Distances and Spiders

barneyb — Thu, 24 Sep 2009 07:26:23 +0000

An edit or string distance is the "distance" between two strings in terms of editing operations.Â For example, to get from "cat" to "dog" requires three operations (replace 'c' with 'd', replace 'a' with '0', and finally replace 't' with 'g'), thus the edit or string distance between "cat" and "dog" is three.Â Aside from replace, there are also the insert and delete operations, so the distance between "cowbell" and "crowbar" is four (insert 'r', replace 'e' with 'a', replace 'l' with 'r', delete 'l').Â This particular sort of edit distance is called the Levenshtein distance.

Here is an implementation of a function in Groovy that does the computation (based on the psuedocode at http://en.wikipedia.org/wiki/Levenshtein_distance):

def editDistance(s, t) {
  int m = s.length()
  int n = t.length()
  int[][] d = new int[m + 1][n + 1]
  for (i in 0..m) {
    d[i][0] = i
  }
  for (j in 0..n) {
    d[0][j] = j
  }
  for (j in 1..n) {
    for (i in 1..m) {
      d[i][j] = (
        s[i - 1] == t[j - 1]
        ? d[i - 1][j - 1] // same character
        : Math.min(
            Math.min(
              d[i - 1][j] + 1, // delete
              d[i][j - 1] + 1 // insert
            ),
            d[i - 1][j - 1] + 1 // substitute
          )
      )
    }
  }
  d[m][n]
}

That might not seem very useful, but consider the problem of grouping strings together.Â This works especially well for URLs, which are hierarchical in nature, and therefore typically differ in only small ways from other similar URLs (at least as far as the site in question's internal concept of organization is concerned).Â As a concrete example, you wouldn't expect "http://www.google.com/search" and "http://mail.google.com/a/barneyb.com" to be very similar pages, because their URLs are quite different.Â However, you'd probably expect "http://mail.google.com/a/barneyb.com" and "http://mail.google.com/a/example.com" to be similar.

This came up as part of my never ending quest to optimize Pic of the Day's behaviour, specifically the spidering aspect.Â Consider a photo gallery page on some arbitrary site.Â The core of it is 10-20 links to pages that show full-size images, but that is surrounded by (and possibly interleaved with) navigation, links to related content, advertisements, etc.Â So the task is to get those core links and none of the other garbage.Â Also keep in mind that the algorithm has to work on arbitrary pages from arbitrary sites, so you can't rely on any sort of contextual markup.

Fortunately, this is a simple task with a string distance-based algorithm.Â Consider this list of URLs (the 'href' values for all 'a' tags on some gallery page, sorted alphabetically, and trimmed for illustrative purposes):

http://www.hq69.com

http://www.hq69.com/cgi-bin/te/o.cgi?g=home

http://www.hq69.com/galleries/andi_sins_body_paint/index.php
http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_001.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_002.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_003.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_004.php

http://www.hq69.com/galleries/juliya_studio_nude/index.php

http://www.hq69.com/galleries/khloe_loves_ibiza/index.php

http://www.hq69.com/galleries/lola_spray/index.php

You can quite easily see that the URLs we want are lines4-7 (in blue) via visual scan, and while you might not realize it, you compared the relative difference between all the URLs and decided those four were sufficiently similar to be considered the "target" URLs.Â The first part is an edit distance of some sort, and the latter part is based on some sort of relative threshold for acceptance.Â More subtle is that lines 3, 8, 9, and 10 (in green) are also quite similar, but not nearly as similar as the first group.

Of course, using the string distance function to arrive at this relationship isn't a direct path.Â On approach is to compare every string to each other, and then build a graph of similarity to deduce the clusters.Â Unfortunately, this is prohibitively expensive for moderately sized sets.Â It's also really complicated to implement.Â ;)

Much easier is to build the clusters directly.Â Create an (empty) collection of clusters, and then loop over the URLs.Â For each URL iterate through the clusters until you find one it is sufficiently similar to and add it.Â If you don't find a suitable cluster, create a new cluster with the URL as the sole member.Â Here's some code that does just that:

clusters = []
threshold = 0.5
urls.each { url ->
  for (c in clusters) {
    if (1.0 * editDistance(url, c[0]) / Math.max(url.length(), c[0].length()) <= threshold) {
      cluster.add(url)
      return
    }
  }
  clusters.add([url])
}

You'll end up with an array of arrays, with each inner array being a cluster of similar URLs matching the clusters outlined above.Â The 'threshold' variable determines how close the strings must be in order to be considered cluster members.Â In this case I'm using a threshold of 0.5, which means that the the edit distance must be no more than half the max length of the two strings.Â I.e., at least half the characters have to match.Â This is a ridiculously low threshold, but I needed it in order for the second cluster to materialize.Â In practice you'd want a threshold of 0.05 to 0.1 I'd say, though I haven't spent much time tuning.

This algorithm is reasonable fast and greatly reduces the number of distance computations required in order to build the clusters.Â However, it's still pretty slow.Â Fortunately, there are a few heavy-handed optimizations to make.

First and simplest, URLs are typically most different at the right end (i.e. the protocol and domain are usually constant), and since identical substrings don't change the distance computation, stripping an identical prefix from the strings might greatly reduce the amount of checking required without impacting the accuracy.

Second, since the difference cannot be less that the difference in length between the two strings, we can check that against the threshold up front and avoid doing any part of the distance computation.

Third, we can push the threshold check partially into the editDistance() function so that it will abort as soon as a sufficient distance is found without having to check the rest of the strings.

Fourth and finally, keeping the clusters sorted by size (largest first) assures that we'll get the most matches with the fewest cluster seeks, which reduces the number of comparisons that need to be made.Â For equal-sized clusters, putting the one with shorter URLs first will further increase the chance that the "difference in length" check (optimization two) will trigger, saving even more comparisons.

Here's the complete code with these optimizations in place (optimization two moved the threshold check into a separate method):

Update 2009/09/25: I found a bug in the short-circuiting evaluation mechanism (optimization three), and have corrected the code below.Â Fixing this issue required doing the diagonal optimization I mentioned at the end of the post.Â It is highlighted in green.Â It limits the building of the 'd' matrix to only the diagonal stripe that it is possible to traverse within the bounds of the provided threshold.

def editDistance(s, t, threshold = Integer.MAX_VALUE) {
  for (i in 0..
  int m = s.length()
  int n = t.length()
  int[][] d = new int[m + 1][n + 1]
  for (i in 0..((int) Math.min(m, threshold))) {
    d[i][0] = i
  }
  for (j in 0..((int) Math.min(n, threshold))) {
    d[0][j] = j
  }
  for (j in 1..n) {
    int min = Math.max(j / 2, j - threshold - 1)
    int max = Math.min(m, j + Math.min(j, threshold) + 1)
    for (i in min..max) {
    for (i in 1..m) {
      d[i][j] = (
        s[i - 1] == t[j - 1]
        ? d[i - 1][j - 1] // same character
        : Math.min(
            Math.min(
              d[i - 1][j] + 1, // delete
              d[i][j - 1] + 1 // insert
            ),
            d[i - 1][j - 1] + 1 // substitute
          )
      )
      if (d[i][j] > threshold) {
        return d[i][j]threshold * 2 // falsely inflate to avoid floating point issues
      }
    }
  }
  d[m][n]
}
def doStringsMatch(s, t, threshold) {
  if (s == t) {
    return true;
  } else if (s == "" || t == "") {
    return false;
  }
  def maxLen = Math.max(s.length(), t.length())
  if (Math.abs(s.length() - t.length()) / maxLen > threshold) {
    return false
  }
  1.0 * editDistance(s, t, threshold * maxLen) / maxLen <= threshold
}
clusters = []
threshold = 0.1
clusterComparator = { o1, o2 ->
  def n = o2.size().compareTo(o1.size())
  if (n != 0) {
    return n
  }
  o1[0].length().compareTo(o2[0].length())
} as Comparator
urls.each { url ->
  clusters.sort(clusterComparator)
  for (cluster in clusters) {
    if (doStringsMatch(url, cluster[0], threshold)) {
      cluster.add(url)
      return
    }
  }
  clusters.add([url])
}

PotD For The World

barneyb — Sat, 16 May 2009 15:20:02 +0000

Last night was sort of the release of Pic of the Day (not safe for work, or my mom) into the wild. Â The project is a couple months shy of five years old, and while I've talked about it obliquely all over the place, I've never really publicized it directly. Â I'd made the assumption that it was just a shared secret because it comes up in conversation over beers quite frequently, and I do actually refer to it by name occasionally, but I was quite wrong. Â I brought my little MOO "business cards" to distribute, and I was amazed at the response.

For some background, PotD spiders the internet looking for pictures, downloads them, and then sends them to people, one pic per day. Â Subscribers can rate each picture, right from their email, and the system learns what they want and tries to send them more of it. Â It started as a joke; a friend and I thought it'd be funny to spam a third friend's inbox with dirty pictures, just for the hell of it. Â Five years later, I have a fairly robust prioritization engine that I've had great fun building.

Obviously handing out business cards with dirty pictures on the back is going to be a conversation starter, but it was really interesting to talk about all the minutiae of the application. Â I got a number of very interesting suggestions to add to my queue of things to think about. Â The most common comment/question was about monetization. Â I'm sure I could make a killing if I wanted to, but PotD is a hobby. Â It's something I do for fun, in my free time. Â The primary reason I do any sort of publicizing is because the data analysis only becomes relevant as the subscriber population grows, and that's really the fun part. Â And since it's free (and the interface is largely asynchronous – via email), I don't have to worry about ensuring it's completely stable, error free, and available all the time. Â That makes hacking a lot more fun because, lets face it, stability, error handling, and availability aren't typically the "fun" part of application development.

What was perhaps more interesting was the amount of stuff I've done that has PotD as it's sole impetus (or at least primary impetus). Â If you go look at my projects page, seven of the eleven projects were created purely for PotD, most notably TransactionAdvice, SchemaTool, FB3Lite, and Amazon S3 integration. Â The other three are ComboBox, FlexChart, andjQuery Checkbox Range Selection. Â CFGroovy had it's impetus in PotD as well, though the Hibernate aspects quickly grew (out of proportion, in hindsight), and I've not actually used it in PotD beyond a couple trivial spots. Â Beyond the actual projects, everything I do with SVG, Batik and Weka (a data mining package), plusÂ a lot of Google Charts stuff and most of those damned query performance issues are all PotD.

Over the years, the application has lived on four different servers, including one that was accessible only via an asynchronous proxy written in PHP. Â Yeah, really. Â It's pure Adobe ColdFusion, starting on CF7, but now on CF8.0.1. Â FB3Lite is the front controller, ColdSpring is used for all the DI/AOP needs, and the codebase is largely procedural even though the majority of it is packaged as CFCs. Â Excluding third-party code, there are 129 CFM files (9,760 lines), 53 CFCs (15,794 lines), 12 JS files (2,402).Â Â The database has about 650K data records, along another 1.4M records that are "non-data", if you will (log tables, lookup tables, etc.). Â None of these numbers are particularly sizable, and the application itself is far from the largest I've worked on, but I'd say it's the most complex because of how many different pieces they are, the variety of jobs they do, and the level of automation in the various data flows.

So welcome to the world, Pic of the Day.