Archive for the 'tools' Category

CF Groovy Redux

I just got through significantly revamping CF Groovy so that it's CFC based, instead of all in the custom tags.  The custom tags remain, but now you can create and manage a CF Groovy runtime instance manually, rather than letting the tags do it for you.  This will greatly assist in production performance as the tag-based version recreated everything each request.  It's also a prerequisite for Hibernate support, as Hibernate spinup is non-trivial and needs to be avoided unless needed, even in development.

These changes are still in the Hibernate branch in Subversion, not the main trunk.  I don't plan on moving the new code back to the trunk until the Hibernate integration is complete.

Query/Reporting DSL Bug Fix

Rob Pilic found a couple local variables that I (gasp!) forgot to var scope in the DSL implementations while he was troubleshooting some concurrency issues.  There was one in each file, though totally unrelated.  I've applied his patch to Subversion and updated the demo install (though it won't matter, because it doesn't share instances).

Scriptlets in CF Anyone?

My last post about Comparators via CF Groovy was simplistic in nature, but the underlying concept is incredibly powerful.  Here's a similar snippet (from the demo app), this time using a Comparator class:

<g:script>
Collections.sort(variables.a, new ReverseDateKeyComparator())
</g:script>

So what do we have here?  Why it's a snippet of Java embedded directly in your CFML page, and it gets dynamically compiled and executed just like the CFML.  Let me say that again: it's a snippet of Java embedded directly in your CFML page.  It's even better than that, though, because it's Groovy, which is Java plus a whole lot more.

In addition to the scriptlet functionality, CF Groovy provides a classloading framework for Groovy scripts and classes, in addition to the ones provided by the CFML runtime and the JVM.  You can define an extra Groovy classpath for your scriptlets like this:

<g:setPath path="#expandPath('/groovy')#,#expandPath('/packages')#" />

There's no way to tell how the ReverseDateKeyComparator class in the first example is implemented, but it happens to be a Groovy class that I've added to the classpath with the <g:setPath> tag.  That class, just like the scriptlet, is dynamically compiled and executed just like the CFML.  This is also very powerful, because it eliminates the compile/build/deploy cycle that is usually required for Java code in a JEE environment.

Combining these two, you can use <g:script> to execute arbitrary scripts from the Groovy classpath, instead of inlining them:

<g:script name="myScript.groovy" />

This lets you reuse scripts across CFML pages, as well as reusing classes between scripts.

I've added CF Groovy to my Projects page, so that's the place for the latest info, as well as some documentation and the all-important Subversion information.

FB3 Lite Updates

After 6 months of use in the wild, I decided to consolidate and republish a couple mods I've made to the framework. The core functionality is unchanged, so for background, check my original post from last year. The enhancements, in no particular order, are as follows:

  • Changed the config at the top of index.cfm to use CFPARAM instead, so the framework can be parameterized by calling code.
  • Most significant is the addition of an 'appSearchPath' control variable, which means index.cfm (the core file) and fbx_Switch.cfm no longer have to be in the same directory. This allows you to put index.cfm in some arbitrary directory (probably with svn:externals) and use it from there. It defaults to ".,.." (this directory and the parent directory), but can be set to whatever is needed.
  • Added the context path and script name to the front of the 'self' variable. I realize this breaks the core on non-JEE CFML implementations, but it fixes some weird redirect issues with IE, so I feel it's worth it. If you don't want it, just remove everything before the question mark.
  • Removed defaulting of the empty fuseaction. Now only a missing fuseaction will be set to the configured default.
  • Added CFABORT to the end of the location() UDF. I went back and forth on this one for quite a while (like measured in days) when I was originally writing the framework, and finally decided to leave it out. After using the framework for a while, I've decided that was the wrong decision, so I've put it in there in this version.
  • Added ability to use multiple circuits via the new 'allowMultipleCircuits' control variable. It defaults to true, which means any slashes in your fuseactions will be treated as path segments, potentially breaking existing apps. If you need to use slashes in your fuseactions, set it to false. When true, a fuseaction of "circuit/fusection" will be converted into an include of "circuit/fbx_Switch.cfm" with attributes.currentFuseaction set to "fuseaction". I've intentionally NOT used the dot separator because they are not circuit aliases as in real Fusebox, but are simply path segments. The include() UDF has always had this ability do to it's strictly path-based nature, though it was undocumented.
  • Added some comments to index.cfm.

There is a project page for the framework, as well as Subversion access available. Current utilization (that I'm aware of) is about 15 distinct applications, most of which I either own or am a contributor to.

Checkbox Range Selection Update

Just a little update to my checkbox range selection jQuery plugin to allow chaining.  I'd forgotten to return 'this' at the end of the function.  Here's the full source, including the mod:

(function($) {
  $.fn.enableCheckboxRangeSelection = function() {
    var lastCheckbox = null;
    var $spec = this;
    $spec.bind("click", function(e) {
      if (lastCheckbox != null && e.shiftKey) {
        $spec.slice(
          Math.min($spec.index(lastCheckbox), $spec.index(e.target)),
          Math.max($spec.index(lastCheckbox), $spec.index(e.target)) + 1
        ).attr({checked: e.target.checked ? "checked" : ""});
      }
      lastCheckbox = e.target;
    });
    return $spec;
  };
})(jQuery);

You can check the project page as well, for full history and updates.

Prototype's Array.any/all with jQuery

I needed to convert a couple Array.any() and Array.all() calls (from Prototype) to jQuery syntax. Since jQuery doesn't extend the built-in objects with nice functionality like this, you have to fake it. Here's what I came up with. Old version ('images' is an array):

images.all(function(o){return o.status == "ready";})

and the new version:

jQuery.inArray(false,
  jQuery.map(images, function(o){return o.status == "ready";})
) < 0

Array.any()'s equivalent is the reverse:

jQuery.inArray(true,
  jQuery.map(images, function(o){return o.status == "ready";})
) >= 0

Might have to rip the Enumerable and Array extensions out of Prototype for standalone use as well.

Prototype and jQuery

Since I discovered it a few years ago, I've been a big Prototype fan.  It's simple, and gets the job done with a minimum of fuss.  It's not without warts, of course.  I still occasionally forget to put 'new' in front of Ajax.Request, and some of the Ruby-like methods share their lineage's arcane naming.  When it was new, it was the best thing around, and while it now has competitors, it's certainly not lagging behind.

At work, however, jQuery has been adopted as the standard (and I've no power to change it).  The lack of the $() function is annoying; several times I've debated adding this function (or one of various similar ones) to our library:

function $(id) {
  return jQuery("#" + id)[0];
}

I haven't, of course, as it's not the jQuery way.  jQuery also lacks any sort of class assistance, so we still use the Prototype class framework for our class-based JS.  That seems to work fairly well, except for the fact that we have to use two frameworks where one could suffice.

jQuery is not without it's benefits, of course.  The plugin architecture is a nice aspect that Prototype didn't really offer an equivalent of.  It means the core stays lighter (good), but if you want additional functionality you're stuck managing files from a bunch of different projects (annoying).  Event handling is a bit more straightforward, in some ways.  "Magically" acting on collections of elements with a single call (i.e. no .each(function(o){…}) garbage) definitely makes for more readable code as well.

Because of this shift at work, I've been porting some of my personal apps over to jQuery as well.  I've actually been using a couple jQuery plugins (both self-written and external) for specific tasks for a while now, but not the core framework.  What I've found, however, is that jQuery can be prone to slow code.  To avoid a huge amount of extra work on the part of the JS interpreter, using temporary variables for jQuery objects is essential.  If you do strictly id-based queries, the degradation isn't huge, but if you do CSS-based queries, it can be significant.  With Prototype's focus on id-based queries (at least until $$() came about in 1.5), that was less of an issue.

This need to query a minimum number of times can provide a fair amount of complexity when you have more than a handful of closures hanging about and/or a dynamic DOM.  You end up doing a lot of state management work because you're, in effect, caching DOM lookups and have to ensure you never have stale cache.

Other than that issue and the lack of an equivalent to document.viewport, porting has been relatively painless.  Still very id heavy, so not leveraging jQuery as much as could be, but most of what I'm doing wouldn't benefit from other selectors.

Which one is better?  Hard to say.  jQuery seems to make you work harder to type less code, while Prototype seems to cost you a few more characters for a bit less density.  With the exception of Prototype's class support, their feature sets are fairly equivalent, especially with jQuery UI now available to "compete" with Scriptaculous.  For the moment, I'm choosing to use jQuery on new stuff, but wishing for Prototype every few minutes.  Until I come up against some sort of significant wall, it'll probably stay that way, just to stick with the same tooling professionally and personally.  And over time it'll probably get better as the Prototype-ness fades from apps.

IndentXml CF UDF

I had a need to fix indentation of some XML today, and a quick Googling didn't turn up much help. So I wrote a little UDF that will take an XML string and return it with all the tags nicely indented:

<cffunction name="indentXml" output="false" returntype="string">
  <cfargument name="xml" type="string" required="true" />
  <cfargument name="indent" type="string" default="  "
    hint="The string to use for indenting (default is two spaces)." />
  <cfset var lines = "" />
  <cfset var depth = "" />
  <cfset var line = "" />
  <cfset var isCDATAStart = "" />
  <cfset var isCDATAEnd = "" />
  <cfset var isEndTag = "" />
  <cfset var isSelfClose = "" />
  <cfset xml = trim(REReplace(xml, "(^|>)\s*(<|$)", "\1#chr(10)#\2", "all")) />
  <cfset lines = listToArray(xml, chr(10)) />
  <cfset depth = 0 />
  <cfloop from="1" to="#arrayLen(lines)#" index="i">
    <cfset line = trim(lines[i]) />
    <cfset isCDATAStart = left(line, 9) EQ "<![CDATA[" />
    <cfset isCDATAEnd = right(line, 3) EQ "]]>" />
    <cfif NOT isCDATAStart AND NOT isCDATAEnd AND left(line, 1) EQ "<" AND right(line, 1) EQ ">">
      <cfset isEndTag = left(line, 2) EQ "</" />
      <cfset isSelfClose = right(line, 2) EQ "/>" />
      <cfif isEndTag>
        <!— use max for safety against multi-line open tags —>
        <cfset depth = max(0, depth - 1) />
      </cfif>
      <cfset lines[i] = repeatString(indent, depth) & line />
      <cfif NOT isEndTag AND NOT isSelfClose>
        <cfset depth = depth + 1 />
      </cfif>
    <cfelseif isCDATAStart>
      <!—
      we don't indent CDATA ends, because that would change the
      content of the CDATA, which isn't desirable
      —>
      <cfset lines[i] = repeatString(indent, depth) & line />
    </cfif>
  </cfloop>
  <cfreturn arrayToList(lines, chr(10)) />
</cffunction>

There's nothing XML-ish about the implementation, as you can see, so you can happily feed non-XML tag based markup, as long as it uses '<' and '>' as tag delimiters in the XML fashion. Just don't expect to get good formatting if you don't have tags that follow the XML spec (e.g. CFELSE).  Also, it doesn't account for open tags (or closing tags) that are split across multiple lines. That wasn't a case I cared about, and I don't know that you can solve it correctly without actually parsing at least CDATA blocks out of the XML.

FlexChart Updates

The past month or so has seen quite a few improvements and bug fixes to FlexChart, though I haven't blogged about any of them.  Most notably, there was a weird NPE that manifested itself when loading a Pie chart via FlashVars.  For some unknown reason, Flex/Flash didn't give any indication the error was occurring, it just silently terminated the active call stack and continued on it's merry way.  This left the app in a quasi-broken state that would prevent certain future calls from working, but allowing others to execute without issue.  I still have no explanation as to why the error silently terminated, but I've since seen the same behaviour inside FDS, so it's not charting specific.

The ability to style charts has been extended a bit, though it's still not as highly polished as I could wish.  For example, supplying a stroke weight for a line series causes the stroke color to default to black, instead of the automatically assigned color (orange, green, blue, …).  In the reverse case, if you supply custom colors on a Pie chart, they render correctly, but the legend (if one is used) uses the default colors (orange, green, blue, …).  Gradient fills are now available as well.

I've also improved handling of empty charts.  The stock custom tag requires a descriptor, but if you don't have data at page render time, you usually end up providing "<chart />" as the descriptor.  The engine now detects this case (whether on the initial load or later passed in), and is a bit more intelligent about ensuring it clears it's stage.  Previously you could end up with an empty CartesianChart in some cases.

Finally, I made a number of improvements to performance and handling of data values.  This was mostly accomplished by explicitly converting the XML nodes into real objects for the chart to render, rather than using the XML directly.  There were some implicit type conversions that didn't happen consistently out of XML nodes, but work fine out of generic objects.

Data Mining With Weka

I've a large application that has a as a major component rank-based prioritization of assets. Users rank the assets on a one-to-five scale, and then that rank data is used to select other assets of interest for the user. If you've seen Amazon's "Recommended for you" or Netflix's recommended titles, you get the general idea.

The app was originally built back in 2004, and used a complex (and cumbersome, and slow) metadata-based algorithm. Each asset has a set of metadata facets specified. At prioritization time, an overall rank is computed for each facet for the given user, based on the rank of assets with the different facets. Unranked assets that have the high-ranked facets and not the low-ranked facets are given a high prioritization. If you've used Pandora, it's the same general idea, though I used far fewer facets. Overall, this algorithm worked quite well. I've tuned it over the years, but it's architecturally unchanged from the initial version.

However, this approach has one huge problem (aside from the complexity): it requires metadata. That metadata has to be populated by someone, and it's a thankless job. I tried a few different ways to make it easier for users to contribute, but never really hit on anything that worked well, so I ended up spending a bit of time every once and a while tagging stuff. As the asset and user counts increase, the workload only goes up, so not a scalable solution.

Which brings me to the topic of the post: data mining with Weka.

Data mining is basically digging through a crapton of low-level data to find higher-level information. Weka is a piece of software, written in Java, that provides an array of machine learning tools, many of which can be used for data mining.

In my particular case, I wanted to remove the metadata dependency of the prioritization algorithm, and rely strictly on rank data. It took a while to really wrap my head around what I wanted to do and what the data path actually looked like, but once I figured it out it was incredibly simple to implement.

In a nutshell, I create a relation (i.e. table, spreadsheet, grid) with rows representing assets and columns representing user. The intersection of each row/column (i.e. a cell) is the rank from that user for that asset. Obviously not every user has ranked every asset, but Weka happily deals with missing data (expressed as a question mark). Here's a partial data set (12 each of assets and users), in Weka's ARFF format:

@relation 'asset-ranks'
@attribute assetId string
@attribute u2 numeric
@attribute u5 numeric
@attribute u6 numeric
@attribute u7 numeric
@attribute u8 numeric
@attribute u9 numeric
@attribute u10 numeric
@attribute u12 numeric
@attribute u13 numeric
@attribute u18 numeric
@attribute u20 numeric
@attribute u21 numeric
@data
48,1,?,?,2,?,?,?,?,?,?,?,?
50,?,?,?,3,?,?,?,2,?,?,?,?
52,1,3,4,2,?,?,4,?,?,?,?,?
70,4,3,5,5,?,2,3,3,1,4,?,?
73,2,3,1,5,?,2,?,5,1,5,?,?
91,3,?,5,2,?,?,?,?,?,?,?,?
165,1,2,4,5,1,?,?,3,1,4,1,?
196,4,2,4,3,5,3,?,?,?,?,?,?
234,3,5,4,2,4,4,4,4,3,5,?,5
235,?,5,5,1,?,2,?,?,?,?,?,?
259,?,?,5,4,?,?,?,?,1,?,?,?
261,3,4,5,4,5,4,?,3,?,?,?,?

Running that through Weka's clustering engine breaks all the assets into clusters averaging 50 assets (my choice) in size, and appends a cluster identifier to each row in the data file. Here's command line I use:

java -classpath weka.jar \
  weka.filters.unsupervised.attribute.AddCluster \
  -i $srcFile \ # the data above
  -I 1 \
  -W "weka.clusterers.SimpleKMeans -N $clusterCount" \ # ceil(rows / 50)
  >& $destFile # the data above, with a 'cluster' attribute added

The clusters represent groups of assets that the ranks indicate are related. The assumption is that for a given users, all assets in a given cluster will be ranked similarly, and the data bears that out. How exactly Weka is doing that, I'm not sure - voodoo may be at play.

Anyway, I read the result into the database, setting up asset-cluster relationships, and then can prioritize the clusters based on their average rank by each user. Unranked assets from the highest-priority cluster should be the assets the user is most interested in.

This approach is not only much simpler, it's enormously faster, and it uses someone else's code (which is always a good thing). However, it's not without a significant problem of it's own: it can only prioritize ranked assets. I've addressed this by randomly mixing in an occasional random unranked asset to seed the pool. Time will tell if that approach works well or not; it's hard to estimate without any data.

With my trials, the two algorithms generally gave similar results. Not identical, of course, but similar. What's interesting is that the old algorithm computes an estimated rank for each unranked asset, while the latter just finds a collection of similar assets that the user indicated an interest in (via ranking some members of the collection). I'll probably look at some predictive stuff to add on top of the clustering to do actual per-asset rank predictions, but for now, it seems unneeded.

I'll be using Weka on some other projects, no question there. Like so much else, the hard part is figuring out how to express the question you want answered. Not technically so much as conceptually. Once you have that, implementation is straightforward.