Edit Distances and Spiders

An edit or string distance is the "distance" between two strings in terms of editing operations.  For example, to get from "cat" to "dog" requires three operations (replace 'c' with 'd', replace 'a' with '0', and finally replace 't' with 'g'), thus the edit or string distance between "cat" and "dog" is three.  Aside from replace, there are also the insert and delete operations, so the distance between "cowbell" and "crowbar" is four (insert 'r', replace 'e' with 'a', replace 'l' with 'r', delete 'l').  This particular sort of edit distance is called the Levenshtein distance.

Here is an implementation of a function in Groovy that does the computation (based on the psuedocode at http://en.wikipedia.org/wiki/Levenshtein_distance):

def editDistance(s, t) {
  int m = s.length()
  int n = t.length()
  int[][] d = new int[m + 1][n + 1]
  for (i in 0..m) {
    d[i][0] = i
  }
  for (j in 0..n) {
    d[0][j] = j
  }
  for (j in 1..n) {
    for (i in 1..m) {
      d[i][j] = (
        s[i - 1] == t[j - 1]
        ? d[i - 1][j - 1] // same character
        : Math.min(
            Math.min(
              d[i - 1][j] + 1, // delete
              d[i][j - 1] + 1 // insert
            ),
            d[i - 1][j - 1] + 1 // substitute
          )
      )
    }
  }
  d[m][n]
}

That might not seem very useful, but consider the problem of grouping strings together.  This works especially well for URLs, which are hierarchical in nature, and therefore typically differ in only small ways from other similar URLs (at least as far as the site in question's internal concept of organization is concerned).  As a concrete example, you wouldn't expect "http://www.google.com/search" and "http://mail.google.com/a/barneyb.com" to be very similar pages, because their URLs are quite different.  However, you'd probably expect "http://mail.google.com/a/barneyb.com" and "http://mail.google.com/a/example.com" to be similar.

This came up as part of my never ending quest to optimize Pic of the Day's behaviour, specifically the spidering aspect.  Consider a photo gallery page on some arbitrary site.  The core of it is 10-20 links to pages that show full-size images, but that is surrounded by (and possibly interleaved with) navigation, links to related content, advertisements, etc.  So the task is to get those core links and none of the other garbage.  Also keep in mind that the algorithm has to work on arbitrary pages from arbitrary sites, so you can't rely on any sort of contextual markup.

Fortunately, this is a simple task with a string distance-based algorithm.  Consider this list of URLs (the 'href' values for all 'a' tags on some gallery page, sorted alphabetically, and trimmed for illustrative purposes):

http://www.hq69.com

http://www.hq69.com/cgi-bin/te/o.cgi?g=home

http://www.hq69.com/galleries/andi_sins_body_paint/index.php
http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_001.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_002.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_003.php

http://www.hq69.com/galleries/beatrix_secrets/beatrix_secrets_004.php

http://www.hq69.com/galleries/juliya_studio_nude/index.php

http://www.hq69.com/galleries/khloe_loves_ibiza/index.php

http://www.hq69.com/galleries/lola_spray/index.php

You can quite easily see that the URLs we want are lines4-7 (in blue) via visual scan, and while you might not realize it, you compared the relative difference between all the URLs and decided those four were sufficiently similar to be considered the "target" URLs.  The first part is an edit distance of some sort, and the latter part is based on some sort of relative threshold for acceptance.  More subtle is that lines 3, 8, 9, and 10 (in green) are also quite similar, but not nearly as similar as the first group.

Of course, using the string distance function to arrive at this relationship isn't a direct path.  On approach is to compare every string to each other, and then build a graph of similarity to deduce the clusters.  Unfortunately, this is prohibitively expensive for moderately sized sets.  It's also really complicated to implement.  ;)

Much easier is to build the clusters directly.  Create an (empty) collection of clusters, and then loop over the URLs.  For each URL iterate through the clusters until you find one it is sufficiently similar to and add it.  If you don't find a suitable cluster, create a new cluster with the URL as the sole member.  Here's some code that does just that:

clusters = []
threshold = 0.5
urls.each { url ->
  for (c in clusters) {
    if (1.0 * editDistance(url, c[0]) / Math.max(url.length(), c[0].length()) <= threshold) {
      cluster.add(url)
      return
    }
  }
  clusters.add([url])
}

You'll end up with an array of arrays, with each inner array being a cluster of similar URLs matching the clusters outlined above.  The 'threshold' variable determines how close the strings must be in order to be considered cluster members.  In this case I'm using a threshold of 0.5, which means that the the edit distance must be no more than half the max length of the two strings.  I.e., at least half the characters have to match.  This is a ridiculously low threshold, but I needed it in order for the second cluster to materialize.  In practice you'd want a threshold of 0.05 to 0.1 I'd say, though I haven't spent much time tuning.

This algorithm is reasonable fast and greatly reduces the number of distance computations required in order to build the clusters.  However, it's still pretty slow.  Fortunately, there are a few heavy-handed optimizations to make.

First and simplest, URLs are typically most different at the right end (i.e. the protocol and domain are usually constant), and since identical substrings don't change the distance computation, stripping an identical prefix from the strings might greatly reduce the amount of checking required without impacting the accuracy.

Second, since the difference cannot be less that the difference in length between the two strings, we can check that against the threshold up front and avoid doing any part of the distance computation.

Third, we can push the threshold check partially into the editDistance() function so that it will abort as soon as a sufficient distance is found without having to check the rest of the strings.

Fourth and finally, keeping the clusters sorted by size (largest first) assures that we'll get the most matches with the fewest cluster seeks, which reduces the number of comparisons that need to be made.  For equal-sized clusters, putting the one with shorter URLs first will further increase the chance that the "difference in length" check (optimization two) will trigger, saving even more comparisons.

Here's the complete code with these optimizations in place (optimization two moved the threshold check into a separate method):

Update 2009/09/25: I found a bug in the short-circuiting evaluation mechanism (optimization three), and have corrected the code below.  Fixing this issue required doing the diagonal optimization I mentioned at the end of the post.  It is highlighted in green.  It limits the building of the 'd' matrix to only the diagonal stripe that it is possible to traverse within the bounds of the provided threshold.

def editDistance(s, t, threshold = Integer.MAX_VALUE) {
  for (i in 0..<Math.min(s.length(), t.length())) {
    if (s[i] != t[i]) {
      s = s.substring(i)
      t = t.substring(i)
      break;
    }
  }
  int m = s.length()
  int n = t.length()
  int[][] d = new int[m + 1][n + 1]
  for (i in 0..((int) Math.min(m, threshold))) {
    d[i][0] = i
  }
  for (j in 0..((int) Math.min(n, threshold))) {
    d[0][j] = j
  }
  for (j in 1..n) {
    int min = Math.max(j / 2, j - threshold - 1)
    int max = Math.min(m, j + Math.min(j, threshold) + 1)
    for (i in min..max) {
    for (i in 1..m) {
      d[i][j] = (
        s[i - 1] == t[j - 1]
        ? d[i - 1][j - 1] // same character
        : Math.min(
            Math.min(
              d[i - 1][j] + 1, // delete
              d[i][j - 1] + 1 // insert
            ),
            d[i - 1][j - 1] + 1 // substitute
          )
      )
      if (d[i][j] > threshold) {
        return d[i][j]threshold * 2 // falsely inflate to avoid floating point issues
      }
    }
  }
  d[m][n]
}
def doStringsMatch(s, t, threshold) {
  if (s == t) {
    return true;
  } else if (s == "" || t == "") {
    return false;
  }
  def maxLen = Math.max(s.length(), t.length())
  if (Math.abs(s.length() - t.length()) / maxLen > threshold) {
    return false
  }
  1.0 * editDistance(s, t, threshold * maxLen) / maxLen <= threshold
}
clusters = []
threshold = 0.1
clusterComparator = { o1, o2 ->
  def n = o2.size().compareTo(o1.size())
  if (n != 0) {
    return n
  }
  o1[0].length().compareTo(o2[0].length())
} as Comparator
urls.each { url ->
  clusters.sort(clusterComparator)
  for (cluster in clusters) {
    if (doStringsMatch(url, cluster[0], threshold)) {
      cluster.add(url)
      return
    }
  }
  clusters.add([url])
}

2 responses to “Edit Distances and Spiders”

  1. Jake Munson

    This is very cool stuff.

  2. Edit Distances Bug at BarneyBlog

    [...] Contact « Edit Distances and Spiders [...]