Monthly Archive for April, 2008

FlexChart Updates

The past month or so has seen quite a few improvements and bug fixes to FlexChart, though I haven't blogged about any of them.  Most notably, there was a weird NPE that manifested itself when loading a Pie chart via FlashVars.  For some unknown reason, Flex/Flash didn't give any indication the error was occurring, it just silently terminated the active call stack and continued on it's merry way.  This left the app in a quasi-broken state that would prevent certain future calls from working, but allowing others to execute without issue.  I still have no explanation as to why the error silently terminated, but I've since seen the same behaviour inside FDS, so it's not charting specific.

The ability to style charts has been extended a bit, though it's still not as highly polished as I could wish.  For example, supplying a stroke weight for a line series causes the stroke color to default to black, instead of the automatically assigned color (orange, green, blue, …).  In the reverse case, if you supply custom colors on a Pie chart, they render correctly, but the legend (if one is used) uses the default colors (orange, green, blue, …).  Gradient fills are now available as well.

I've also improved handling of empty charts.  The stock custom tag requires a descriptor, but if you don't have data at page render time, you usually end up providing "<chart />" as the descriptor.  The engine now detects this case (whether on the initial load or later passed in), and is a bit more intelligent about ensuring it clears it's stage.  Previously you could end up with an empty CartesianChart in some cases.

Finally, I made a number of improvements to performance and handling of data values.  This was mostly accomplished by explicitly converting the XML nodes into real objects for the chart to render, rather than using the XML directly.  There were some implicit type conversions that didn't happen consistently out of XML nodes, but work fine out of generic objects.

New SSL Certificate

I installed a new "real" SSL certificate today, instead of the self-signed ones I've been using for the past few years.  It's the cheap model, but it's recognized by both IE and Firefox, which is an improvement.  Unfortunately, Eclipse doesn't recognize the CA, so you still get prompts there if you connect to my SVN.

Data Mining With Weka

I've a large application that has a as a major component rank-based prioritization of assets. Users rank the assets on a one-to-five scale, and then that rank data is used to select other assets of interest for the user. If you've seen Amazon's "Recommended for you" or Netflix's recommended titles, you get the general idea.

The app was originally built back in 2004, and used a complex (and cumbersome, and slow) metadata-based algorithm. Each asset has a set of metadata facets specified. At prioritization time, an overall rank is computed for each facet for the given user, based on the rank of assets with the different facets. Unranked assets that have the high-ranked facets and not the low-ranked facets are given a high prioritization. If you've used Pandora, it's the same general idea, though I used far fewer facets. Overall, this algorithm worked quite well. I've tuned it over the years, but it's architecturally unchanged from the initial version.

However, this approach has one huge problem (aside from the complexity): it requires metadata. That metadata has to be populated by someone, and it's a thankless job. I tried a few different ways to make it easier for users to contribute, but never really hit on anything that worked well, so I ended up spending a bit of time every once and a while tagging stuff. As the asset and user counts increase, the workload only goes up, so not a scalable solution.

Which brings me to the topic of the post: data mining with Weka.

Data mining is basically digging through a crapton of low-level data to find higher-level information. Weka is a piece of software, written in Java, that provides an array of machine learning tools, many of which can be used for data mining.

In my particular case, I wanted to remove the metadata dependency of the prioritization algorithm, and rely strictly on rank data. It took a while to really wrap my head around what I wanted to do and what the data path actually looked like, but once I figured it out it was incredibly simple to implement.

In a nutshell, I create a relation (i.e. table, spreadsheet, grid) with rows representing assets and columns representing user. The intersection of each row/column (i.e. a cell) is the rank from that user for that asset. Obviously not every user has ranked every asset, but Weka happily deals with missing data (expressed as a question mark). Here's a partial data set (12 each of assets and users), in Weka's ARFF format:

@relation 'asset-ranks'
@attribute assetId string
@attribute u2 numeric
@attribute u5 numeric
@attribute u6 numeric
@attribute u7 numeric
@attribute u8 numeric
@attribute u9 numeric
@attribute u10 numeric
@attribute u12 numeric
@attribute u13 numeric
@attribute u18 numeric
@attribute u20 numeric
@attribute u21 numeric
@data
48,1,?,?,2,?,?,?,?,?,?,?,?
50,?,?,?,3,?,?,?,2,?,?,?,?
52,1,3,4,2,?,?,4,?,?,?,?,?
70,4,3,5,5,?,2,3,3,1,4,?,?
73,2,3,1,5,?,2,?,5,1,5,?,?
91,3,?,5,2,?,?,?,?,?,?,?,?
165,1,2,4,5,1,?,?,3,1,4,1,?
196,4,2,4,3,5,3,?,?,?,?,?,?
234,3,5,4,2,4,4,4,4,3,5,?,5
235,?,5,5,1,?,2,?,?,?,?,?,?
259,?,?,5,4,?,?,?,?,1,?,?,?
261,3,4,5,4,5,4,?,3,?,?,?,?

Running that through Weka's clustering engine breaks all the assets into clusters averaging 50 assets (my choice) in size, and appends a cluster identifier to each row in the data file. Here's command line I use:

java -classpath weka.jar \
  weka.filters.unsupervised.attribute.AddCluster \
  -i $srcFile \ # the data above
  -I 1 \
  -W "weka.clusterers.SimpleKMeans -N $clusterCount" \ # ceil(rows / 50)
  >& $destFile # the data above, with a 'cluster' attribute added

The clusters represent groups of assets that the ranks indicate are related. The assumption is that for a given users, all assets in a given cluster will be ranked similarly, and the data bears that out. How exactly Weka is doing that, I'm not sure - voodoo may be at play.

Anyway, I read the result into the database, setting up asset-cluster relationships, and then can prioritize the clusters based on their average rank by each user. Unranked assets from the highest-priority cluster should be the assets the user is most interested in.

This approach is not only much simpler, it's enormously faster, and it uses someone else's code (which is always a good thing). However, it's not without a significant problem of it's own: it can only prioritize ranked assets. I've addressed this by randomly mixing in an occasional random unranked asset to seed the pool. Time will tell if that approach works well or not; it's hard to estimate without any data.

With my trials, the two algorithms generally gave similar results. Not identical, of course, but similar. What's interesting is that the old algorithm computes an estimated rank for each unranked asset, while the latter just finds a collection of similar assets that the user indicated an interest in (via ranking some members of the collection). I'll probably look at some predictive stuff to add on top of the clustering to do actual per-asset rank predictions, but for now, it seems unneeded.

I'll be using Weka on some other projects, no question there. Like so much else, the hard part is figuring out how to express the question you want answered. Not technically so much as conceptually. Once you have that, implementation is straightforward.

I Sense AdSense

After the merciless hounding of Joshua, along with some of my own curiosity, I added AdSense ads to my blog this weekend.  My plan is to leave them there until the end of May, and then remove them.  I can't imagine there's even close to sufficient potential income to justify the ugly factor, but perhaps I'll be pleasantly surprised.  Not holding my breath.

The Best of the Best

Eight and a half years ago, I posted the 26th best mid-season time for the NCAA men's 200 Freestyle.  That was as close as I ever got to being the best of the best.  I figure that's pretty good; roughly the 99.999997th percentile of college-age men.  Three and a half weeks later, I quit competitive swimming for Heather and a life of code, never to return.

I've never come close to that level of proficiency since.  Being there again is something I long for perhaps more than anything else.  Utter competence - no question of success or failure, just how grand the success will be.  No worry about the task at hand, complete trust in yourself and the ability to enjoy every moment in all it's glory.  "Poetry in motion" is cliche to the nines ; ), but it's exactly what it is.

It being Thursday night, Heather's off at choir, the kids are in bed, and I'm tired for beating my head against problem after problem (usually clients who can't make up their mind) at work.  So here I am dumping my mind into a blog post that few will read and fewer will really understand, listening to a song stream from YouTube.  Jesu Joy of Man's Desiring is a most beautiful song, and Celtic Woman's rendition from the Helix really brought back that feeling of perfection.  From 1:00 through 1:25 (particularly at 1:10 and 1:25), effortless perfection and enjoying every minute.  The rest of the song is beautiful, but pales by comparison.

Get Your ColdSpring (et al)!

For those of you not in the know, ColdSpring is a port of the Spring framework (for Java) to ColdFusion.  It provides an Inversion of Control (IoC) framework and an Aspect Oriented Programming (AOP) container for CFCs.  If you used stateful CFCs, you should be using ColdSpring.  Period.

But that's not why I'm writing.

I am absolutely flabbergasted by the growth surrounding ColdSpring (and really, CF in general).  Five years ago the thought of debating the intricacies of IoC and metadata introspection in the CF community would have been a joke.  And yet today that conversation happened on the ColdSpring mailing list.  Look back 10 years, and the closest thing to a framework was a pair of custom tags and some lose conventions for structuring apps (i.e. Fusebox 2).

Beyond ColdSpring, the "big 3″ UI-layer frameworks (Fusebox, Mach-II, Model-Glue) are in wide use, undergoing active development, and growing strong and active user bases.  At least two major ORM solutions (Reactor and Transfer) are proliferating.  We've got a community built IDE (CFEclipse) that blows the pants of any other available tool, commercial or otherwise.

So if you're a CF user, go buy yourself a beer.

Ajaxian on Prototype vs JQuery

Ajaxian posted a little blurb on benchmarking Prototype and jQuery today. I've been a Prototype guy for years, but at the office we've gone from all-Prototype to all-jQuery, and performance degredation was one of the things I noticed. I never did any actual benchmarking, just went by feel, but it's interesting to see that my perceptions were well founded.

Whether performance of JS libraries should be a huge determinant in picking one to use is up for grabs. Unless the client-side is doing a hell of a lot of work, these days' computers have plenty of CPU hanging about unused.  However, in the past couple months we've spent a lot of time working around JS performance issues at the office. I can't say that using Prototype instead of jQuery would have eliminated the bottlenecks, but clearly performance matters.

WordPress Upgrade

Just finished upgrading to WordPress 2.5, the latest K2 nightly, and a few other plugins. All went pretty smoothly, except my custom K2 style needed some tweaks to the CSS due to some selector changes in K2's markup. They've done some nice things with the admin UI, and the new Admin Drop Down Menu makes it way better.

One thing I noticed is that the post slug is committed on the first autosave, which I don't recall being the case before. You can edit it, of course, but if enter you title, and then to edit it later, your slug isn't automatically updated anymore. The category list is also in a far less handy position beneath the editor, rather than next to it.

Overall, however, looks like good stuff. Still waiting for official Wordpress 2.5 support from K2, but certainly not holding anything up.

Build-Time Aggregation of JS/CSS Assets

Ben Nadel posted about compiling multiple linked files (JS/CSS) into a single file this morning, and he does it at runtime. I commented about doing it at build-time instead, and a couple people were wondering more, so here's a brief explaination.

The first part is a properties file (which can be read by both Ant and CF (or whatever)). Here's an example (named agg.js.properties):

# the type of file being aggregated (used to do minification)
type         = js
# the URL path the files are relative to.
urlBasePath  = /marketing/js/
# the list of filenames to aggregate.  The first line (with the equals
# sign) should be a filename and a slash, all other lines should be a
# comma, a filename, and a slash  Indentation is irrelevant.
filenames    = date.js\
  ,jquery-latest.js\
  ,ui.datepicker.js\
  ,ui.mouse.js\
  ,ui.slider.js\
  ,ui.draggable.js\
  ,jquery.dimensions.js\
  ,jquery.easing.1.2.js\
  ,jquery-easing-compatibility.1.2.js\
  ,coda-slider.1.1.1.js\
  ,jquery.tooltip.min.js\
  ,jScrollPane.min.js\
  ,jquery.metadata.js\
  ,prototype.classes.js\
  ,reporting.js\
  ,jquery.ajaxQueue-min.js\
  ,script.js

This sets up the everything needed for the aggregation. Within our project, we have this file as a peer of the property file (named agg.js.cfm):

<cfscript>
filename = replace(getCurrentTemplatePath(), ".cfm", ".properties");
fis = createObject("java", "java.io.FileInputStream").init(filename);
bis = createObject("java", "java.io.BufferedInputStream").init(fis);
props = createObject("java", "java.util.Properties").init();
props.load(bis);
urlBasePath = props.getProperty("urlBasePath");
type = props.getProperty("type");
filenames = listToArray(props.getProperty('filenames'));
for (i = 1; i LTE arrayLen(filenames); i = i + 1) {
	if (type EQ "css") {
		writeOutput('<link rel="stylesheet" href="#urlBasePath##filenames[i]#" type="text/css" />');
	} else { // js
		writeOutput('<script src="#urlBasePath##filenames[i]#" type="text/javascript"></script>');
	}
	writeOutput(chr(10));
}
</cfscript>

It reads the properties file, and writes out either LINK or SCRIPT tags as appropriate to the individual assets. This facilitates easy debugging in development, because nothing is modified from it's source. The file is included into the HEAD of our layout templates to get everything in page.

The real magic happens with Ant, which we use for our deployments. Within the build file, we have a call to the aggregateAssets target for each properties file:

<antcall target="aggregateAssets">
  <param name="propfile" value="${output}/wwwroot/marketing/templates/agg.js.properties" />
  <param name="rootdir" value="${output}/wwwroot/marketing/js" />
</antcall>

The params specify the properties file and the root directory. Note that the rootdir param corresponds with the urlBasePath in the properties file. The target itself looks like this:

<target name="aggregateAssets">
  <!-- read the aggregation properties -->
  <property file="${propfile}" prefix="agg" />

  <!-- get the root -->
  <propertyregex property="agg.root"
    input="${propfile}"
    regexp="^(.*)\.properties$"
    select="\1" />

  <!-- split the root into file and path sections -->
  <propertyregex property="agg.fileroot"
    input="${agg.root}"
    regexp="^.*/([^/]+)$"
    select="\1″ />
  <propertyregex property="agg.pathroot"
    input="${agg.root}"
    regexp="^(.*/)[^/]+$"
    select="\1″ />

  <!– set up the output file stuff –>
  <property name="agg.outfile" value="${rootdir}/${agg.fileroot}" />
  <property name="agg.cfmfile" value="${agg.root}.cfm" />
  <property name="minsuffix" value=".yuimin" />

  <!– run everything through the YUI Compressor –>
  <for list="${agg.filenames}" param="filename">
    <sequential>
      <echo message="compressing @{filename} to @{filename}${minsuffix} (in ${rootdir})" />
      <java classname="com.yahoo.platform.yui.compressor.YUICompressor"
        failonerror="true"
        output="${rootdir}/@{filename}${minsuffix}"
        append="true"
        logError="true"
        fork="true">
        <arg value="–type"/>
        <arg value="${agg.type}"/>
        <arg value="–nomunge"/>
        <arg file="${rootdir}/@{filename}" />
        <classpath>
          <pathelement path="${java.class.path}"/>
        </classpath>
      </java>
    </sequential>
  </for>

  <!– aggregate all the compressed files together –>
  <echo file="${agg.outfile}" message="// built by Ant using YUI Compressor" />
  <for list="${agg.filenames}" param="filename">
    <sequential>
      <concat destfile="${agg.outfile}" append="true">
        <header trimleading="true">
          // @{filename}
        </header>
        <filelist dir="${rootdir}" files="@{filename}${minsuffix}" />
      </concat>
    </sequential>
  </for>

  <!– delete all the compressed files –>
  <delete>
    <fileset dir="${rootdir}" includes="*${minsuffix}" />
  </delete>

  <!– write the CFM file to pull in the compressed and aggregated file –>
  <if>
    <equals arg1="${agg.type}" arg2="css" />
    <then>
      <echo file="${agg.cfmfile}"><![CDATA[<link rel="stylesheet" href="${agg.urlBasePath}${agg.fileroot}" type="text/css" />]]></echo>
    </then>
    <else>
      <echo file="${agg.cfmfile}"><![CDATA[<script src="${agg.urlBasePath}${agg.fileroot}" type="text/javascript"></script>]]></echo>
    </else>
  </if>
</target>

First, it reads the properties file, runs each listed asset through the YUI Compressor, and then aggregates the result. Finally, it overwrites agg.js.cfm (from above) with one that contains a single LINK/SCRIPT element to the aggregation result. End result is a single aggregated, compressed asset in production for speed, and separate uncompressed assets in development for easy debugging.

Edit: Do note that you'll need both the ant-contrib package and the YUI Compressor JARs to be installed into Ant for this to work.

S3 is Sweet (One App Down)

This weekend I ported my big filesystem-based app to S3, and it went like a dream. It's a image-management application, with all the actual images stored on disk. In addition to the standard import/edit/delete, the app provides automatic on-the-fly thumbnail generation, along with primitive editing capabilities (crop, resize, rotate, etc.). With images on local disk, that's all really easy: read them in, do whatever, write them back out. I figured using S3 would make things both more cumbersome and less performant. Both suspicions turned out to be unwarranted.

Building on the 's3Url' UDF that I published last week, I whipped up a little CFC to manage file storage on S3 with a very simple API. It has s3Url, putFileOnS3, getFileFromS3, s3FileExists, and deleteS3File methods, which all do about what you'd expect. You can grab the code here: amazons3.cfc.txt (make sure you remove the ".txt" extension) or visit the project page. It uses the simple HTTP-based interface, so after the authentication is handled, it's all very simple and fast. I haven't looked at the SOAP interface - why bother complicating a simple task?

With that CFC (and an application-specific wrapper to take care of some path-related transforms), porting the whole app took about two hours. I also realized after I was mostly done that the CF image tools accept URLs as well as files, so I switched my image reads to just use URLs instead of pulling the file local and reading it from disk.

As for moving all the actual content, S3Sync was a champ, moving about 4.5GB of data from my Cari server to S3 in a few hours, including gracefully handling a couple errors raised by S3 (which a retry - performed automatically - solved), and a stop/restart in the middle. Total cost: about 65 cents.

Next is porting the blogs, including all the Picasa-based galleries. Unfortunately, that means writing PHP, but with how easy the CF stuff was, I don't think it'll be too much effort.