Data Mining With Weka

I've a large application that has a as a major component rank-based prioritization of assets. Users rank the assets on a one-to-five scale, and then that rank data is used to select other assets of interest for the user. If you've seen Amazon's "Recommended for you" or Netflix's recommended titles, you get the general idea.

The app was originally built back in 2004, and used a complex (and cumbersome, and slow) metadata-based algorithm. Each asset has a set of metadata facets specified. At prioritization time, an overall rank is computed for each facet for the given user, based on the rank of assets with the different facets. Unranked assets that have the high-ranked facets and not the low-ranked facets are given a high prioritization. If you've used Pandora, it's the same general idea, though I used far fewer facets. Overall, this algorithm worked quite well. I've tuned it over the years, but it's architecturally unchanged from the initial version.

However, this approach has one huge problem (aside from the complexity): it requires metadata. That metadata has to be populated by someone, and it's a thankless job. I tried a few different ways to make it easier for users to contribute, but never really hit on anything that worked well, so I ended up spending a bit of time every once and a while tagging stuff. As the asset and user counts increase, the workload only goes up, so not a scalable solution.

Which brings me to the topic of the post: data mining with Weka.

Data mining is basically digging through a crapton of low-level data to find higher-level information. Weka is a piece of software, written in Java, that provides an array of machine learning tools, many of which can be used for data mining.

In my particular case, I wanted to remove the metadata dependency of the prioritization algorithm, and rely strictly on rank data. It took a while to really wrap my head around what I wanted to do and what the data path actually looked like, but once I figured it out it was incredibly simple to implement.

In a nutshell, I create a relation (i.e. table, spreadsheet, grid) with rows representing assets and columns representing user. The intersection of each row/column (i.e. a cell) is the rank from that user for that asset. Obviously not every user has ranked every asset, but Weka happily deals with missing data (expressed as a question mark). Here's a partial data set (12 each of assets and users), in Weka's ARFF format:

@relation 'asset-ranks'
@attribute assetId string
@attribute u2 numeric
@attribute u5 numeric
@attribute u6 numeric
@attribute u7 numeric
@attribute u8 numeric
@attribute u9 numeric
@attribute u10 numeric
@attribute u12 numeric
@attribute u13 numeric
@attribute u18 numeric
@attribute u20 numeric
@attribute u21 numeric
@data
48,1,?,?,2,?,?,?,?,?,?,?,?
50,?,?,?,3,?,?,?,2,?,?,?,?
52,1,3,4,2,?,?,4,?,?,?,?,?
70,4,3,5,5,?,2,3,3,1,4,?,?
73,2,3,1,5,?,2,?,5,1,5,?,?
91,3,?,5,2,?,?,?,?,?,?,?,?
165,1,2,4,5,1,?,?,3,1,4,1,?
196,4,2,4,3,5,3,?,?,?,?,?,?
234,3,5,4,2,4,4,4,4,3,5,?,5
235,?,5,5,1,?,2,?,?,?,?,?,?
259,?,?,5,4,?,?,?,?,1,?,?,?
261,3,4,5,4,5,4,?,3,?,?,?,?

Running that through Weka's clustering engine breaks all the assets into clusters averaging 50 assets (my choice) in size, and appends a cluster identifier to each row in the data file. Here's command line I use:

java -classpath weka.jar \
  weka.filters.unsupervised.attribute.AddCluster \
  -i $srcFile \ # the data above
  -I 1 \
  -W "weka.clusterers.SimpleKMeans -N $clusterCount" \ # ceil(rows / 50)
  >& $destFile # the data above, with a 'cluster' attribute added

The clusters represent groups of assets that the ranks indicate are related. The assumption is that for a given users, all assets in a given cluster will be ranked similarly, and the data bears that out. How exactly Weka is doing that, I'm not sure – voodoo may be at play.

Anyway, I read the result into the database, setting up asset-cluster relationships, and then can prioritize the clusters based on their average rank by each user. Unranked assets from the highest-priority cluster should be the assets the user is most interested in.

This approach is not only much simpler, it's enormously faster, and it uses someone else's code (which is always a good thing). However, it's not without a significant problem of it's own: it can only prioritize ranked assets. I've addressed this by randomly mixing in an occasional random unranked asset to seed the pool. Time will tell if that approach works well or not; it's hard to estimate without any data.

With my trials, the two algorithms generally gave similar results. Not identical, of course, but similar. What's interesting is that the old algorithm computes an estimated rank for each unranked asset, while the latter just finds a collection of similar assets that the user indicated an interest in (via ranking some members of the collection). I'll probably look at some predictive stuff to add on top of the clustering to do actual per-asset rank predictions, but for now, it seems unneeded.

I'll be using Weka on some other projects, no question there. Like so much else, the hard part is figuring out how to express the question you want answered. Not technically so much as conceptually. Once you have that, implementation is straightforward.

4 responses to “Data Mining With Weka”

  1. Jim

    Hi Barney,

    How did your test of seeding unranked items go? Did it skew the results at all?

    Can you post some of your finding?

    Thanks,

    –j

  2. Weka Mining Update: It Works! at BarneyBlog

    [...] in April, I posted about how I was using Weka to do some asset prioritization.  The gist of it was that users would rank assets on a 1-5 scale, and then then the system would [...]

  3. Dan

    The clustering algorithm Weka uses is defined in the command you use, in the case above it is the k-means clustering.
    http://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm

    The Expectation Maximisation (EM) clustering is "better" and as a plus you don't have to guess/provide the number of clusters as it can figure this out using cross-validation (internally runs multiple times and picks the number of clusters which resulted in the highest expectation).