Visiting Recursion

Recursion is a very powerful technique, but when processing certain types of data structures you can end up with problems.  And not the "compiler error" kind of problems, the bad semantic kind.  Fortunately, at least one class of them is pretty easy to solve with visit tracking, though it's a subtle solution.  But I'm going to build up to that from bare-bones recursion first.

As you know, a recursive algorithm is one which splits the problem into two parts: the base case and the recursive case.  The base case is a fixed solution to a simple form of the problem the algorithm is designed to solve.  The recursive case, on the other hand, is a dynamic solution which is defined as "a little work plus the solution to a simpler form of the same problem".  That simpler form of the problem is solved by reinvoking the recursive algorithm, and eventually reaches the base case (which breaks the cycle).  Consider computing factorials, which is the de facto standard for discussing recursion:

function factorial(n) {
  if (n < 0)
    throw 'NegativeArgumentException'
  return n == 0 || n == 1 ? 1 : n * factorial(n - 1)
}

Here the conditional is used to pick between the base case and the recursive case.  As you can see, the recursive case is defined by a small bit of work (multiplying by 'n') and then solving a simple form of the problem (the factorial of n – 1).  The most important thing to note is that the recursive case is guaranteed to eventually recurse down to the base case (or raise an exception).  That means infinite recursion is impossible, which is good.  Now onward to the interesting stuff…

Consider this data structure:

barney = {
  "name": "Barney",
  "sex": "male",
  "dob": "1980-06-10",
  "children": [
    {
      "name": "Lindsay",
      "sex": "female",
      "dob": "2004-01-09",
      "children": []
    },
    {
      "name": "Emery",
      "sex": "male",
      "dob": "2005-08-12",
      "children": []
    }
  ]
}

This is a hierarchical data structure (i.e., a tree), where there is a single root node (the structure named "Barney") and then some number of child nodes.  We can see here that there are only two levels ("Barney" and "Barney's children"), but you can imagine here in 40 years that there will very likely be at least one more level, probably two more.  The point is that we don't know how deep the structure is, but we do know that it isn't infinitely deep.  It has to end somewhere.

How could I find the average age of everyone?  Well, I'd use recursion to do it, but if you think about what we need to do, you'll quickly see there's a problem.  Computing an average has to be done after all the aggregation is complete – you can't average subaverages and have it come out right unless you also weight the subaverages.  So we'll need to either sum the total age and divide by the number of people or compute subaverages and track the number of people "within" them.  I'm going to take the first approach.  Here are my functions:

function averageAgeInSeconds(family) {
  var result = {
    totalAge: 0,
    count: 0
  }
  averageAgeParts(family, result)
  return Math.round(result.totalAge / result.count)
}
function averageAgeParts(family, result) {
  for (var i = 0, l = family.length; i < l; i++) {
    result.totalAge += getAgeInSeconds(family[i])
    result.count += 1
    averageAgeParts(family[i].children, result)
  }
}
function getAgeInSeconds(person) {
  return Math.floor((new Date().valueOf() - Date.parse(person.dob)) / 1000)
}

As you can see, I'm creating a 'result' structure for storing my aggregates, then using the 'averageAgeParts' helper function which is where the recursion happens.  After the aggregates are done, I'm doing the division in the main function to get the average.  Note that I don't have an explict base case anywhere.  The reason is that it's impossible for the data structure to be infinitely deep; I'm relying on that fact to act as my base case and let the recursion bottom out.  In more direct terms, at some point the loop inside averageAgeParts will be traversing an empty array (e.g., Lindsay's children), which means the recursive call will not be invoked (the base case).

Just for reference, the average age is about 14.7 years.

The important thing to note here is that my main function isn't directly recursive.  Instead it delegates to a recursive helper method and supplies some additional context (the 'result' variable) for it to operate on while wakling the tree.  This extra stuff is critical for solving a lot of problems with recursion.  I'm not going to show the implementation, but consider how you'd change this so you could request the average age of a family, but constrain it to only females (be careful to ensure you count female descendents of males).  How about if you wanted to allow counting a certain number of generations (regardless of the total tree depth)?

Now on to the whole point: visitation.  The data structure I've shown to this point is a tree, as we discussed.  But it doesn't represent the real world: a child has two parents, not one.  That's no longer hierarchical, so we can't stick with a tree, we need to generalize into a graph.  (Just to be clear, trees are graphs, but with the extra constraint of hierarchy.)  So how might our structure look now?  First of all, we can't represent it with a literal; we'll need to define a tree structure first and then add some extra linkages to "break" the tree nature and revert it to just a graph.  Using the 'barney' object from above, here's how our graph might look.

boisverts = [
  barney,
  {
    "name": "Heather",
    "sex": "female",
    "dob": "1980-02-12",
    "children": Array.concat(barney.children)
  }
]

The last line creates a 'children' array for Heather which contains the same contents as Barney's 'children' array (but is a separate array).  Now we can access Emery as either boisverts[0].children[1] or as boisverts[1].children[1] and it's the same object.  That's important: it's what makes this a graph instead of a tree.  So now what happens when we run our averageAgeInSeconds function on 'boisverts'?  It'll run just dandy, but while the correct answer is 18.9 years, the result will be 14.8 years.  The reason is that it'll count Emery and Lindsay twice (once as Barney's children and again as Heather's children).

What we need is some way to keep track of what we've already processed (visited) so we can avoid processing stuff multiple times, and we can do that with a new subkey in our existing 'result' structure:

function averageAgeInSeconds(family) {
  var result = {
    totalAge: 0,
    count: 0,
    visited: []
  }
  averageAgeParts(family, result)
  return Math.round(result.totalAge / result.count)
}
function averageAgeParts(family, result) {
  for (var i = 0, l = family.length; i < l; i++) {
    if (visit(family[i], result)) {
      result.totalAge += getAgeInSeconds(family[i])
      result.count += 1
      averageAgeParts(family[i].children, result)
    }
  }
}
function visit(o, result) {
  for (var i = 0, l = result.visited.length; i < l; i++) {
    if (result.visited[i] === o) {
      return false
    }
  }
  result.visited.push(o)
  return true
}

What I've done is create a 'visit' function to keep track of which objects have been visited.  Now 'averageAgeParts' uses 'visit' to check if it should visit (process) an object.  I've implemented 'visit' with an array as I wanted illustrative clarity not performance.  In the real world you'd use a hashtable instead of an array so you'll have O(n) peformance instead of O(n2).

Now when we run this we'll get 18.9 years, as expected, because even though Emery and Lindsay will be iterated over twice, they'll only be processed the first time.  More specifically, the 'visit' function will return false on the second pass so the body of the loop will be skipped and their values won't be added to the aggregates the second time.

This sort of problem crops up all the time in computer science, and while I've shown it here with recursion, it's not necessarily tied to recursive algorithms.  One of the ones which is quite common, and fortunately "below the radar" for a lot of people, is garbage collection in modern virtual machines.  Another place is with serializing objects, either for permanent storage or transport across the wire.  The specific solutions are different, but they all end up using some sort of visit tracking on the graph nodes or edges.

NB: This post was prompted by a discussion on the Taffy users mailing list, but as it has myriad other implications, I've discussed the topic in a more general way.

NB: All code is JavaScript, and should run in any reasonable environment.  I did all my testing in Firefox 3.5, but I'd expect this to work back in Netscape 4.7 and IE 4.

http://groups.google.com/group/taffy-users

2 responses to “Visiting Recursion”

  1. Adam Tuttle

    Thanks for the post Barney, this does very clearly illustrate what you were talking about on the mailing list. :)

  2. Glyn Jackson

    Thanks for a in-depth post! As Adam said , clearly illustrates perfectly. Would be cool in the future if possible, to show this with the mapto as an example. Maybe as part of the Taffy examples. Helped me understand a lot and I'm sure it will benefit others too. Thanks again.