Archive for the 'tools' Category

IndentXml CF UDF

I had a need to fix indentation of some XML today, and a quick Googling didn't turn up much help. So I wrote a little UDF that will take an XML string and return it with all the tags nicely indented:

<cffunction name="indentXml" output="false" returntype="string">
  <cfargument name="xml" type="string" required="true" />
  <cfargument name="indent" type="string" default="  "
    hint="The string to use for indenting (default is two spaces)." />
  <cfset var lines = "" />
  <cfset var depth = "" />
  <cfset var line = "" />
  <cfset var isCDATAStart = "" />
  <cfset var isCDATAEnd = "" />
  <cfset var isEndTag = "" />
  <cfset var isSelfClose = "" />
  <cfset xml = trim(REReplace(xml, "(^|>)\s*(<|$)", "\1#chr(10)#\2", "all")) />
  <cfset lines = listToArray(xml, chr(10)) />
  <cfset depth = 0 />
  <cfloop from="1" to="#arrayLen(lines)#" index="i">
    <cfset line = trim(lines[i]) />
    <cfset isCDATAStart = left(line, 9) EQ "<![CDATA[" />
    <cfset isCDATAEnd = right(line, 3) EQ "]]>" />
    <cfif NOT isCDATAStart AND NOT isCDATAEnd AND left(line, 1) EQ "<" AND right(line, 1) EQ ">">
      <cfset isEndTag = left(line, 2) EQ "</" />
      <cfset isSelfClose = right(line, 2) EQ "/>" />
      <cfif isEndTag>
        <!--- use max for safety against multi-line open tags --->
        <cfset depth = max(0, depth - 1) />
      </cfif>
      <cfset lines[i] = repeatString(indent, depth) & line />
      <cfif NOT isEndTag AND NOT isSelfClose>
        <cfset depth = depth + 1 />
      </cfif>
    <cfelseif isCDATAStart>
      <!---
      we don't indent CDATA ends, because that would change the
      content of the CDATA, which isn't desirable
      --->
      <cfset lines[i] = repeatString(indent, depth) & line />
    </cfif>
  </cfloop>
  <cfreturn arrayToList(lines, chr(10)) />
</cffunction>

There's nothing XML-ish about the implementation, as you can see, so you can happily feed non-XML tag based markup, as long as it uses '<' and '>' as tag delimiters in the XML fashion. Just don't expect to get good formatting if you don't have tags that follow the XML spec (e.g. CFELSE).  Also, it doesn't account for open tags (or closing tags) that are split across multiple lines. That wasn't a case I cared about, and I don't know that you can solve it correctly without actually parsing at least CDATA blocks out of the XML.

FlexChart Updates

The past month or so has seen quite a few improvements and bug fixes to FlexChart, though I haven't blogged about any of them.  Most notably, there was a weird NPE that manifested itself when loading a Pie chart via FlashVars.  For some unknown reason, Flex/Flash didn't give any indication the error was occurring, it just silently terminated the active call stack and continued on it's merry way.  This left the app in a quasi-broken state that would prevent certain future calls from working, but allowing others to execute without issue.  I still have no explanation as to why the error silently terminated, but I've since seen the same behaviour inside FDS, so it's not charting specific.

The ability to style charts has been extended a bit, though it's still not as highly polished as I could wish.  For example, supplying a stroke weight for a line series causes the stroke color to default to black, instead of the automatically assigned color (orange, green, blue, …).  In the reverse case, if you supply custom colors on a Pie chart, they render correctly, but the legend (if one is used) uses the default colors (orange, green, blue, …).  Gradient fills are now available as well.

I've also improved handling of empty charts.  The stock custom tag requires a descriptor, but if you don't have data at page render time, you usually end up providing "<chart />" as the descriptor.  The engine now detects this case (whether on the initial load or later passed in), and is a bit more intelligent about ensuring it clears it's stage.  Previously you could end up with an empty CartesianChart in some cases.

Finally, I made a number of improvements to performance and handling of data values.  This was mostly accomplished by explicitly converting the XML nodes into real objects for the chart to render, rather than using the XML directly.  There were some implicit type conversions that didn't happen consistently out of XML nodes, but work fine out of generic objects.

Data Mining With Weka

I've a large application that has a as a major component rank-based prioritization of assets. Users rank the assets on a one-to-five scale, and then that rank data is used to select other assets of interest for the user. If you've seen Amazon's "Recommended for you" or Netflix's recommended titles, you get the general idea.

The app was originally built back in 2004, and used a complex (and cumbersome, and slow) metadata-based algorithm. Each asset has a set of metadata facets specified. At prioritization time, an overall rank is computed for each facet for the given user, based on the rank of assets with the different facets. Unranked assets that have the high-ranked facets and not the low-ranked facets are given a high prioritization. If you've used Pandora, it's the same general idea, though I used far fewer facets. Overall, this algorithm worked quite well. I've tuned it over the years, but it's architecturally unchanged from the initial version.

However, this approach has one huge problem (aside from the complexity): it requires metadata. That metadata has to be populated by someone, and it's a thankless job. I tried a few different ways to make it easier for users to contribute, but never really hit on anything that worked well, so I ended up spending a bit of time every once and a while tagging stuff. As the asset and user counts increase, the workload only goes up, so not a scalable solution.

Which brings me to the topic of the post: data mining with Weka.

Data mining is basically digging through a crapton of low-level data to find higher-level information. Weka is a piece of software, written in Java, that provides an array of machine learning tools, many of which can be used for data mining.

In my particular case, I wanted to remove the metadata dependency of the prioritization algorithm, and rely strictly on rank data. It took a while to really wrap my head around what I wanted to do and what the data path actually looked like, but once I figured it out it was incredibly simple to implement.

In a nutshell, I create a relation (i.e. table, spreadsheet, grid) with rows representing assets and columns representing user. The intersection of each row/column (i.e. a cell) is the rank from that user for that asset. Obviously not every user has ranked every asset, but Weka happily deals with missing data (expressed as a question mark). Here's a partial data set (12 each of assets and users), in Weka's ARFF format:

@relation 'asset-ranks'
@attribute assetId string
@attribute u2 numeric
@attribute u5 numeric
@attribute u6 numeric
@attribute u7 numeric
@attribute u8 numeric
@attribute u9 numeric
@attribute u10 numeric
@attribute u12 numeric
@attribute u13 numeric
@attribute u18 numeric
@attribute u20 numeric
@attribute u21 numeric
@data
48,1,?,?,2,?,?,?,?,?,?,?,?
50,?,?,?,3,?,?,?,2,?,?,?,?
52,1,3,4,2,?,?,4,?,?,?,?,?
70,4,3,5,5,?,2,3,3,1,4,?,?
73,2,3,1,5,?,2,?,5,1,5,?,?
91,3,?,5,2,?,?,?,?,?,?,?,?
165,1,2,4,5,1,?,?,3,1,4,1,?
196,4,2,4,3,5,3,?,?,?,?,?,?
234,3,5,4,2,4,4,4,4,3,5,?,5
235,?,5,5,1,?,2,?,?,?,?,?,?
259,?,?,5,4,?,?,?,?,1,?,?,?
261,3,4,5,4,5,4,?,3,?,?,?,?

Running that through Weka's clustering engine breaks all the assets into clusters averaging 50 assets (my choice) in size, and appends a cluster identifier to each row in the data file. Here's command line I use:

java -classpath weka.jar \
  weka.filters.unsupervised.attribute.AddCluster \
  -i $srcFile \ # the data above
  -I 1 \
  -W "weka.clusterers.SimpleKMeans -N $clusterCount" \ # ceil(rows / 50)
  >& $destFile # the data above, with a 'cluster' attribute added

The clusters represent groups of assets that the ranks indicate are related. The assumption is that for a given users, all assets in a given cluster will be ranked similarly, and the data bears that out. How exactly Weka is doing that, I'm not sure - voodoo may be at play.

Anyway, I read the result into the database, setting up asset-cluster relationships, and then can prioritize the clusters based on their average rank by each user. Unranked assets from the highest-priority cluster should be the assets the user is most interested in.

This approach is not only much simpler, it's enormously faster, and it uses someone else's code (which is always a good thing). However, it's not without a significant problem of it's own: it can only prioritize ranked assets. I've addressed this by randomly mixing in an occasional random unranked asset to seed the pool. Time will tell if that approach works well or not; it's hard to estimate without any data.

With my trials, the two algorithms generally gave similar results. Not identical, of course, but similar. What's interesting is that the old algorithm computes an estimated rank for each unranked asset, while the latter just finds a collection of similar assets that the user indicated an interest in (via ranking some members of the collection). I'll probably look at some predictive stuff to add on top of the clustering to do actual per-asset rank predictions, but for now, it seems unneeded.

I'll be using Weka on some other projects, no question there. Like so much else, the hard part is figuring out how to express the question you want answered. Not technically so much as conceptually. Once you have that, implementation is straightforward.

Read-Only and Read-Write SVN Repositories

Just got a comment on one of my posts from a while back about public SVN access wondering how to get it configured.  The basic idea is to have a single repository with anonymous read-only access, and have the same repository allow read-write access to authenticated users.  Further, you want to configure that on a per-directory basis (with inheritance, of course), so you can have different areas require different principals, and allow some sections to require authentication even for read access.

So without further ado, here's the magic configuration bits.

<Location /svn/barneyb>
    DAV             svn
    SVNPath         /path/to/svnroot/barneyb

    AuthType        Basic
    AuthName        "Subversion/Trac"
    AuthUserFile    /path/to/apache/conf/htpasswd

    AuthzSVNAccessFile  /path/to/apache/conf/authz.conf

    Satisfy     any
    Require     valid-user
</Location>

In this case I'm just using Basic auth with an htpasswd file for authentication.  The magic line is the "AuthzSVNAccessFile" line, which defines the file to use for authorization.  Here's a snippet:

[/]
barneyb = rw

[/bicycle_dashboard]
* = r
barneyb = rw

The first section says that for the root of the repository (/), only barneyb (me) is allowed access, and I'm allowed to read and write.  The second section says that for the /bicycle_dashboard path, I'm still allowed to read and write, but anyone is allowed to read.

The gotcha is that explicitly specified directories do not inherit from their parents.  At each specified level, you must define the full auth spec.  Full details on the authorization file can be found in the Subversion Book.  That link is for the nightly, so if you've got an old version of Subversion, you might want to go grab and older version of the book as well.  The general Apache docs can be found here.

Barney and the Holy Grail

Yes, I know it's Lent, but I'm not talking about that Grail - I'm talking about Grails, the web framework for Groovy (the dynamic layer for Java).

I've been looking for a way to ditch CFML for a while now, but nothing has really hit the mark. I keep hoping that Adobe is going to release CF8.1 with Flex 3 and allow all your server-side scripting to be in ActionScript 3, but I'm not holding my breath. The CF platform is pretty nice (excepting the massive bloat), but the CFML language itself just blows ass. So I shop…

I liked SpringMVC/Spring/Hibernate when I was using it, but it's definitely enterprise-y. I've looked at Rails, but I really dislike Ruby's syntax, and ActiveRecord leaves something to be desired (like a query language). Django and Pylons (for Python) are slick seeming, but Python's whitespace-is-semantic paradigm is a big turnoff. My CFRhino "project" actually has some promise, I think, but not something I want to get into maintaining for real application development, and it's JavaScript, which means no real classes and classloading (among other things).

Enter Grails. Like it's counterparts in other languages, it's all about speed of development, but unlike the others, it's all Java. It's built on Hibernate, Spring, and SpringMVC. It compiles to bytecode. It seamlessly integrates with existing Java apps and tools (Quartz, Acegi, DWR, etc.). The paradigms it leverages were incredibly easy to jump into (granted, I've used it's backing tools before), and within a couple hours of seeing both Grails and Groovy for the first time, I had a quasi-functional blog platform.

The past week of developing a first "real" app has borne out my initial impressions to the nth degree. Groovy is a really slick layer that augments Java in just the right ways, without getting the way. Closure are a big deal, vastly superior to simple functions or Java's inner classes. You also get dynamic methods, iterators, native regular expressions, multi-line strings, ranges, a ternary operator (c'mon CF, it's not the 60's), the "elvis" operator which is the ternary operator with the middle clause omitted and the first clause used as it's value, and null-safe dot operator.

But Grails is where it's at.

class User {
  String username
  String password
  String email

  static constraints = {
    username(blank: false, unique: true, size: 4..16)
    password(blank: false, password: true, size: 4..32)
    email(email: true, blank: false)
  }
}

That whopping 11 lines of code is sufficient to create a domain entity with three fields, create validation for the fields, create a Hibernate mapping, and set up all the SpringMVC goodies for binding, form generation/population, error messages, etc.

Request processing, you ask? You have a collection of controllers (think circuits, if you know Fusebox), each of which is a collection of actions (i.e. named closures) corresponding to requests (think fuseactions). So the URL "/login/doLogout" would map to the "doLogout" closure of the "login" controller. Each action closer can redirect, render directly, or return an arbitrary data structre to be passed to a view. By default, the view is a file named to match the action, in a folder to match the controller (you can override, of course), which is written in Groovy Server Pages (GSP). Think JSP, except without the heavy dose of suck. For example, you can define taglibs using Groovy, and then use them in both your views and your controllers, as either tags or as functions.

Grails (thanks to Spring) also provides all-encompassing dependency injection for not only your services and controllers, but for your domain objects as well, which is fantastic. That was one of the larger problems I had with using Spring and Hibernate directly, and one that I never was able to solve in a very satisfactory way. It also provides controller filters which are exceptionally handy (enforcing logins, adding model data for your layouts, etc.).

And the last big piece is SiteMesh, which is the [optional] layout toolkit. You can certainly use GSP for everything; simply include your header/footer as needed and go. But SiteMesh provides basic page wrapping functionality to ease the task, along with a very powerful content dissection/assembly framework (think Fusebox's contentvariables, except managed by the view as they should be).

Finally, yes, there is pretty comprehensive scaffolding support. It doesn't blow my skirt up as much as the other stuff though, but that's probably more a reflection on the type of apps I'm usually building rather than the general utility of the feature.

In case you can't tell, I'm impressed. The Grails guys have done a fantastic job at taking some incredibly flexible and powerful tools and making them ridiculously easy to use.

And Again!

Another update to FlexChart this evening, providing set and series colors (both fills and strokes), an option for including/excluding the legend when exporting a chart to PNG, and a few new examples showcasing the features (including some really ugly developer art).

With the coloring support, the 'Grouped Series' example makes a lot more visual sense, so if you thought WTF yesterday, go look at it again.  ;)

More FlexCharts Goodness

Another batch of changes to FlexChart.

First is grouped series, which is hard to explain, but easy to understand.  Go hit the demo and select "Grouped Series" from the dropdown.  The chart layout has always been possible, but the legend just displayed a flat series list with no awareness of the groupings.  To see the old behaviour, change the 'legendStyle' attribute of the chart tag to "simple", and hit "Update Chart".

Second is direct support for PNG exports of the chart.  A Base64-encoded PNG can be requested via an ExternalInterface callback which you can then do whatever you want with.  The demo app just sends it server side (in a new window) for deserialization with CFIMAGE and immediately sends it back to the client.  Hardly interesting, but definitely illustrative.

Finally, I've abandoned the compile-in-page model, Flex SDK 2.0.1, and a single file architecture completely.  You can drop the binary anywhere you want without any dependency on FlexBuilder, a Flex SDK, or any server component, which is nice.  Of course, if you want to customize, it required FlexBuilder 3.  I think that's the right tradeoff.  I've also started splitting up the engine for easier maintenance, now that there's no need to keep it single file (which there was when it was server-compiled).  I haven't done much, but that'll come.  Definitely make it easier to work with.

Soon to come down the pipe are custom colors for series and sets, support for multiple y-axes, and a grid legend style which is basically a headerless datagrid under the chart with a row per series and a column per x-axis item, all lined up evenly.

FlexChart Update

Long time no blog…  I've updated my FlexChart component slightly, as well as repackaged it for easier consumption.  The new feature is the availability of a 'dataTipFunction' attribute on the root 'chart' element, which will be called to format data tips.  It gets passed an object with various keys about the backing chart item.  Since that's all it gets (rather than a HitData instance), it's not as powerful as the native Flex capabilities, but it gets the job done with pure JS.

The repackaging was to a binary (precompiled) format, rather than requiring the server compilation I'd originally built it with.  The old style is still available where it always has been, but the preferred usage is now either the distribution directory (svn) or the FlexBuilder project itself.  With the new model, you just drop the compiled binaries (yours or mine) into any web-accessible directory and use them, regardless of whether you have LCDS installed to do the compilation.

Checkbox Range Selection (a la GMail)

If any of you use GMail, you'll know that you can shift click the checkboxes on the conversation list to select a range of conversations (i.e. click the second conversation's checkbox and then shift-click the tenth conversation's checkbox). You can also deselect the same way (click the seventh, and then shift-click the fourth). Finally, you can shift-click a second time (third click total) to extend the range. I wanted that functionality in one of my apps, and here it is.

Update: I've repackaged the code as a jQuery plugin based on Dan Switzer's suggestion. I've left the original code in place, just struck it out.

Update: I've added namespacing of the event handler, and support for the meta key (Command on Mac) as well as shift based on Henrik's comments below.

function configureCheckboxesForRangeSelection(spec) {
  var lastCheckbox = null;
  jQuery(function($) { // for Prototype protection
    var $spec = $(spec);
    $spec.bind("click", function(e) {
      if (lastCheckbox != null && e.shiftKey) {
        $spec.slice(
          Math.min($spec.index(lastCheckbox), $spec.index(e.target)),
          Math.max($spec.index(lastCheckbox), $spec.index(e.target)) + 1
        ).attr({checked: e.target.checked ? "checked" : ""});
      }
      lastCheckbox = e.target;
    });
  }); // for Prototype protection
};

(function($) {
  $.fn.enableCheckboxRangeSelection = function() {
    var lastCheckbox = null;
    var $spec = this;
    $spec.unbind("click.checkboxrange");
    $spec.bind("click.checkboxrange", function(e) {
      if (lastCheckbox != null && (e.shiftKey || e.metaKey)) {
        $spec.slice(
          Math.min($spec.index(lastCheckbox), $spec.index(e.target)),
          Math.max($spec.index(lastCheckbox), $spec.index(e.target)) + 1
        ).attr({checked: e.target.checked ? "checked" : ""});
      }
      lastCheckbox = e.target;
    });
  };
})(jQuery);

It requires jQuery to be available (I used 1.2.1 and 1.2.3), and is safe to use with Prototype also in-scope (regardless of which owns the $ function). Call that function passing in a jQuery expression (NOT a jQuery object) that describes the checkboxes you want to be range-selectable:

configureCheckboxesForRangeSelection("input.image-checkbox");
$("input.image-checkbox").enableCheckboxRangeSelection();

Thanks to Matt Wood (a coworker) for the slice/index suggestion. My initial implementation had used an each with a conditional and a status variable - definitely less elegant.  Check the project page for any additional updates.

Report/Query DSLs Update

I've posted a simple demo app for both DSLs that you can play with. It's included in the distribution as index.cfm, so you'll get it if you pull down the source from SVN. I've also created a readme.txt file in the distribution with the text of my intro blog post.

I also made a very minor (though backwards incompatible) update to my Query DSL implementation. There is a convertToSimpleCriterion method, designed to be overridden by subclasses if necessary. The initial implementation contained the logic for processing negated terms (those prefixed with '-' or '!'), and for splitting terms into predicate/value pairs. Since that's part of the DSL implementation's job, that needed to b done by the internal methods, so the behaviour would be inherited by any subclasses. I've changed the convertToSimpleCriterion method to be passed all three parts individually, so the parsing can live inside the core of the implementation.