Groovy DataSources for Railo

If you've ever wanted to do raw JDBC stuff in your ColdFusion applications, you probably know that you can get a javax.sql.DataSource via this code:

createObject("java", "coldfusion.server.ServiceFactory")
  .dataSourceService.getDatasource("myDSN")

Unfortunately, this doesn't work on Railo, because the result isn't actually a javax.sql.DataSource, it just looks like one (see RAILO-43).  To put that another way, it's duck typed.

Fortunately, Groovy's "fake" interfaces make beating the typechecks fairly trivial (if kind of ugly):

<cfset variables.my.datasource = createObject("java", "coldfusion.server.ServiceFactory")
  .dataSourceService.getDatasource(my.dsn) />
<g:script>
  def rds = variables.my.datasource
  variables.my.datasource = {
    rds.getConnection()
  } as javax.sql.DataSource
</g:script>

The CFSET gets the "datasource" as normal, and then the little Groovy scriptlet simply wraps it with a closure and tells it to pretend to be a javax.sql.DataSource.  In this case we're assuming that the only method on DataSource that will be invoked is getConnection(), which is the usual case.  If you need to support other methods you'd need to go the Map-as-interface route.

Now we can use variables.my.datasource as a javax.sql.DataSource.  Yay!  Unfortunately, you have to use this sort of snippet everywhere you need a DataSource in your Railo apps, but hopefully that'll get fixed.

For more info on Groovy interfaces, the docs are at http://groovy.codehaus.org/Groovy+way+to+implement+interfaces.

CFGroovy at the Tulsa CFUG on May 25th

Next Tuesday (May 25th), I will be presenting CFGroovy to the Tulsa CFUG via Connect.  Details are at http://www.tulsacfug.org/meetings.cfm, but the important bit is that it's at 12:30 central time, not in the evening.  Steve Bryant – who manages the group – graciously extended a general invitation to anyone else who would like to join via Connect.  I'll post again with more details as we get closer to the event.

Do You Know These Shoes?

If you can identify this pair of shoes, I'll buy you a drink of your choice next time we meet.  If you're wearing a pair of them, I'll buy you two.

Red Shoes

GalConvict

This morning, in response to my Processing and Galcon post, I received this email:

Hey,

Stumbled across your article about using processing :)  Sounds neat!

Anyway, I don't like having to make these requests – but because of how
trademark stuff works I have to request that you change the name of your
project to something that doesn't contain the name Galcon.  Maybe
"PlanetConvict" or something like that.

Anyway, I'd really appreciate if you could do that :)

Thanks!
-Phil

As such, as soon as I can find a way to delete the GitHub repo – if you know how, please share – the code will vanish from the public view.  If you're interested in the bits, get 'em quick, or just email me.

Wheeee………….

Processing and Galcon

I did a little experiment last night using Processing, which is a Java-based visual programming environment that I've repeatedly run into in various different contexts, but had never really done anything with.  I've become completely addicted to Galcon Lite on my iPhone, and figured it was a "sample" to build with Processing.  Note that I was not attempting to replicate the look and feel of the game, just the mechanics.  If you don't know Galcon, here's a screenshot:

The basic idea is that you take over planets (which produce more ships), and try to exterminate your opponent (which I'm moments away from doing).  The green planets are mine, the grey planets are neutral, and the orange is my opponent.  Here's a similar just-about-to-win screencapture from my version of the game:

Rather than have masses of ship icons, I grouped them into fleets (the hollow circles) with the number of ships indicated.  Notice that I also don't make my fleets go around other planets, as evidenced by the "26″ fleet going through the "19″ planet right in the middle.  Gameplay, however, is basically identical to the original.

I made the code available on GitHub: http://github.com/barneyb/GalConvict. It's hardly the Mona Lisa.  I made exactly zero effort to architect, refactor, etc.  It was just a proof of concept for playing with Processing. [2010-05-17: This has been retracted - read why]

Embrace Your [HS]QL

I'd like to start with a confession.  I love SQL (and it's Hibernate counterpart HQL).  Its simplicity and elegance is remarkable.  Its expressive power is stunning.  It is almost universally disliked by developers.  Why is that?

The answer, at least to me, is pretty straightforward.  Developers spend a lot of their time programming, so that's where their strengths lie.  Most programming is done with general-purpose languages (GPLs – not to be confused with the Gnu Public License), which are generic, Turing-complete environments.  They a real-world-like typing system (numbers, objects, lists, etc.), and often mutable data structures.  Developers get very good at using such languages to get stuff done.  This isn't a revelation.

SQL, on the other hand, is a domain-specific language (a DSL).  It is not Turing-complete, uses set-based data structures (which are immutable – more on this later), and just generally does stuff in a totally different way than a GPL.  This can create friction for developers, because they have to change mindsets when shifting between their general programming tasks (with a GPL) and their querying tasks (with SQL).

This bit of friction, however, is a source of great power if you're willing to invest a little in the SQL way of doing things.

The Bad Stuff

I'm not going to deny that writing out big-ass INSERT and UPDATE statements is a bear.  CRUD in general – SQL or otherwise – is simply an unpleasant task that should be avoided at all costs.  Fortunately there exist myriad tools and frameworks to alleviate with this problem (Hibernate being a shining example).  For the purposes of this discussion I'm only going to be talking about data recall (i.e., SELECT statements).  The C_UD part of CRUD is a different discussion, and hopefully you're using some "magic" way of dealing with that stuff.

The thing that is interesting to note, however, is that all the toolkits that I've seen to isolate you from the C_UD part of CRUD still provide a query language for the _R__ part.  So clearly a query language (SQL or otherwise) is an important part of the way things work, even if the whole point of a toolkit is to isolate you from SQL itself.

The Good Stuff

First, the "S" in "SQL" stands for "Structured".  Queries using SQL should be structured into little atomic pieces (subqueries) that are easily understandable and then assembled in to larger structures (with no semantic change to the composed bits).  Think of SQL as a box of LEGOs where you build up larger and larger structure from incredibly simple base components.  Here's an example of what I mean, working through it first with HQL, and then with SQL.

HQL

Consider this HQL example (from the cf-orm-dev mailing list):

select distinct p
from Post p
  left join p.categories c
where c.name like 'top%'

This selects the distinct Posts that have a category which starts with "top".  Straightforward.  But not really a very good use of the structured nature of HQL, because we're really asking two nested questions.  The first is the obvious – which Posts – but the second is more subtle: which Categories start with "top".  Consider this version:

from Post p
where exists (
  from p.categories
  where name like 'top%'
)

This is better; now we have a representation of both questions we're asking, as well as the "nested" relationship between them.  But this is still a little wonky because we're doing a dependent subquery when the nested question (which Categories start with "top") is really completely isolated from any association with Posts.  So consider this one:

from Post p
where any elements(p.categories) in (
    from Category
    where name like 'top%'
  )

Now we have the inner question totally isolated from the outer question, and the association between the two questions is expressed via the "any elements in" clause.  Very expressive, very modular, very structured.  And unfortunately, very unsupported by RDBMSs.  This is a valid query construct, but I don't know of any databases that actually support this ANY..IN construct (I tried MS SQL, MySQL, and Postgres).  So you can't actually use it.  Of course, a better HQL and/or SQL preprocessor could transform it, but they don't do that either.  In my opinion this is really Hibernate's problem to fix; it should generate different SQL for this construct.

Unfortunately, there's an … uh, "un-silver" … lining.  In this particular case, most DBs will execute the JOIN more quickly than the subquery, because the subquery is correlated (i.e. must be executed once per row in the outer query).  Aside from the semantic separation, that's another reason that the third (unsupported) query is better.  Since the subquery in the third rev is not correlated it only needs to be run once and the same result can be used to filter all the Post rows.

SQL

Here's the SQL equivalent of the first HQL query:

select distinct p.*
from Post p
  left join link_post_category l on p.id = l.post_id
  left join Category c on l.category_id = c.id
where c.name like 'top%'

This illustrates an interesting fact that the HQL doesn't show as clearly: in order to do the DISTINCT, the SQL engine must do a comparason on every column of every row.  If you're pulling back a few BLOB or CLOB fields this can be prohibitively expensive.  But if your column width is small, this is still fast because it suits the optimizer.  Now here's the correlated subquery version:

select *
from Post p
where exists (
    select *
    from link_post_category l
      left join Category c on l.category_id = c.id
    where l.post_id = p.id
      and c.name like 'top%'
  )

Now we've got rid of that "every column of every row" problem, but replaced it with a correlated subquery.  So lets ditch that next:

select *
from Post p
where id in (
    select l.post_id
    from link_post_category l
      left join Category c on l.category_id = c.id
    where c.name like 'top%'
  )

Now the subquery is pulling back all the post_ids we want to retrieve (in one step), and the outer query is just doing a primary key lookup to grab those rows.  This is going to be crazy fast, but we still have that JOIN in there which makes the subquery kind of nasty.  One more revision:

select *
from Post p
where id in (
    select post_id
    from link_post_category
    where category_id in (
        select id
        from Category
        where name like 'top%'
      )
  )

With a pair of nested subqueries we're down to that LEGO-like ideal.  Starting at the innermost piece, we pull all the Categories that start with "top".  Then we find all the post_ids that are assigned those Categories.  Finally we pull all the Posts with those post_ids.  With MS SQL's query optimizer this will have the same performance characteristics as the prior version (i.e., crazy fast), but MySQL will unfortunately struggle a little bit.

Remember that final HQL query that won't actually execute on any DB I found?  Here's the SQL for it:

select *
from Post p
where any (
    select category_id
    from link_post_category
    where post_id = p.id
  ) in (
    select id
    from Category
    where name like 'top%'
  )

This is roughly equivalent to the second SQL version, but since it won't run, it's academic.

The SQL Way

If you're going to use SQL effectively, you have to think about SQL in it's terms, it's domain.  As a DSL, it's not designed to be a GPL.  It's designed to be used from a set theory perspective and expressed in a very structured fashion.  If you try to use it like a GPL you'll be able to get the job done, but you'll also get frustrated and build some really nasty queries that will be hard to maintain/reuse.  And then you'll hate SQL, and then you'll try to avoid it (and other query languages) at all cost, and really be missing out.

As an example, I challenge you to dig into the Hibernate Criteria API, which is a programmatic way of querying (i.e., an API for a GPL).  Try and create a Criteria query that expresses this question "what is the average age of males with either zero or two or more children" in a way that is even close to readable and concise.  Here is the HQL version (which is both readable and concise):

select avg(age)
from Person
where sex = 'male'
  and (children.size >= 2
    or children.size = 0
  )

As way of a hint, you'll need a few Restrictions instances along with a Projections instance, plus some glue bits.

Immutability

I mentioned the immutability of SQL data structures way back at the beginning, and now I'll follow through.  Obviously state can be changed via UPDATE, INSERT, and DELETE statements, but that's not what I mean.  When you're running a query you have the ability to create temporary data structures (typically subquery results).  They are immutable.  Even if you're using a "programming" SQL extension (e.g., T-SQL or PL/SQL), state is still typically immutable.  You can change variable references, but you can't go changing the objects they reference.

For example, if you have a rowset and want to add a new row, how do you do it?  You don't add a new row, you select the UNION of the current rowset and a new (very likely ad hoc) rowset containing the row to add.  This is a very different paradigm than most GPLs use, where state is typically mutable, at least for reference types.  Consider a Java List.  Calling .add(o) doesn't create a new List with the new item added to it, it just adds it to the existing list.

Functional languages are an exception, as many (most?) embrace the concept of immutable state (or zero side effect) which removes any need to consider concurrency, allows programs' behaviour to be "proven", etc.  In a zero side effect environment, calling .add(o) would create a new List with the new item added and return the new list (the original list would be discarded).  It's definitely a different approach to doing things from what most people are used to, but provides some incredibly powerful capabilities.

Stop Already

Ok, ok.  That ended up being about ten times longer than I intended.  I'm sorry.  But if you got this far you've hopefully grown a bit of an appreciation for SQL, or at least the knowledge that there might be something there to appreciate.  And do keep in mind that the examples I've given are all really simple.  The power of the "Structured" is sort of nice with queries just a few lines long, but it really shines when you're dealing with queries that stretch over hundreds of lines, especially when they're dynamically constructed.  That incredibly granular breakdown of functionality allows you to reuse bits and pieces very effectively, and the zero side effect nature means that you are guaranteed not to have some random side effect ruin your day.

Should It Just Work?

Last Wednesday evening Sean Corfield and I were chatting over beers and he caught me by surprise with a statement he made.  We were talking about the ORM functionality in CF9, and he liked how it just worked.  You invoke a setter, and it gets to the database with nothing else.  I agree in theory, but not in practice.

It had honestly never occurred to me that someone would actually want to have stuff get to the database without any control over it.  Transactionality is such an ingrained part of my being that the mere thought of writing to the database outside an explicit transaction never even entered my mind.

There have been a number of discussions about transactions on the cf-orm-dev mailing list.  Often prompted by someone mentioning some problem and me jumping down their throats about how they better use transactions.  But here's a very experienced developer who I know full well intimately understands transactionality actually saying with complete honesty that he likes to have it just work.

I'm personally against this approach: if something goes awry your database could well get jacked, and no one wants that.  Though I nearly always apply transactionals via AOP, and both Sean and I agreed that assuming AOP familiarity is unreasonable for people using CF9 ORM.

The conversation quickly moved on from there (I believe Scala was the next topic), but that brief moment of complete surprise stuck in my mind.  I certainly don't claim to know it all, but I know that I have opinions that are not shared by everyone.  However, the dismissal of ACID like that was a shock.  I respect Sean's opinion very much, and while I will continue to wrap all my saves in a transaction, things are clearly not as cut and dried as I think they are.

Polyglot Programming at cf.objective()

This afternoon I presented on Polyglot Programming at cf.objective() 2010.  Unlike most presentations I give, this one has almost no code, so the slidedeck (as a PDF) is the whole shebang.  The in-deck content is admittedly light; really just an outline to follow along as I talked.  The short version of the verbal part is:

Using multiple languages has a bit of a learning curve but it pays off, and more quickly than you think.  Language selection and design is a vital aspect to being a successful developer, both for individual projects and as part of your continuing career.

I'll probably give the presentation again on CFMeetup at some point this year, and maybe at a user group or two, so if you missed it all is not lost.  Unfortunately (or fortunately), cf.objective() is so content-rich that it's hard to get to every session you want, so if you missed it you can probably get a second chance.

Domain Model Integrity

Unlike my last several posts, this one isn't ORM related.  At least not directly.  If you're using ORM, you necessarily care about your domain model's integrity, as it's a prerequisite for ORM doing it's job, but it has nothing to do with ORM specifically.  The point of a domain model is to be a representation of your business rules and logic, and that means it needs to be internally consistent.

If you're building a SQL-heavy, procedural application, the database is probably the only place your domain model is represented.  But if you're building an object oriented application, your domain model will also be represented in memory as object graphs.  In almost all cases, a given object graph is only a small slice of your entire domain model, but it is a representation and must be kept consistent.

Here's an example of a very simple domain model consisting of Person and Pet classes, where a Pet has an owner (a Person), and a Person has a collection of Pets:

component Person {
  property name="name" type="string";
  property name="pets" type="array[Pet]";
}

component Pet {
  property name="species" type="string";
  property name="owner" type="Person";
}

Just to reiterate, these are NOT persistent types.  They're simple types for in-memory use only.

So what semantics does this model imply?  Or to rephrase, what invariants does this model carry?  The most important semantic is that the relationship between pets and their owners is expressed from both sides (both classes).  More explicitly stated, the domain model is structured such that if you have a Person you can get their Pets, and if you have a Pet, you can get their owner (a Person).  The implications of this is expressed in these two invariants:

assert pet.owner is null || pet.owner.pets contains pet
assert person.pets.every { pet.owner == person }

The first one states that if a Pet has an owner, that owner's "pets" collection must contain it.  The second one states that every Pet in a Person's "pets" must have that Person set as it's owner.  I'm trying to be really deliberate in spelling this out, because it's really important.

Just for a moment, let's take a detour to the relational database world.  If we were to express this domain model in the database we'd have a Person table and a Pet table, and the Pet table would have a foreign key (likely named 'owner_id') that references the Person table's primary key.  SQL allows us to traverse the relationship expressed by that foreign key in either direction, so both relationships (Person->Pet and Pet->Person) are expressed in a single place (the foreign key column).  Both directions are represented together.  A foreign key constraint (which all RDBMSes support), on the column is doing nothing more than instructing the database to enforce these invariants.  This is all second nature, and we don't even think about it when we use a database to represent our domain model.

Now back to the in-memory representation.  We still need to enforce these invariants, but in memory we have to deal with references (pointers), and references only point in one direction.  That's why we have to have both the 'pets' property (a Person's references to Pets) and the 'owner' property (a Pet's reference to a Person), but in the database we only need one foreign key column (Pet.owner_id).  The relationship between Person and Pet objects is actually expressed in a pair of references.

The problem with this arrangement is that you, in effect, double represent your relationships.  Both invariants must remain true, and since each invariant is represented by it's own reference in the model, you have to synchronize changes to those references.   When you set the owner of a Pet, you must also add that Pet to the owner's "pets" collection.  When you remove a Pet from a Person's "pets" collection, you must also remove the Pet's owner reference.  If you don't keep these in sync, one of your invariants will be false, and that means your domain model is in an invalid/inconsistent state.

When your domain model is in an invalid state, your application falls apart.  Every assumption you make in your application is suddenly unreliable, because they're predicated on your business rules, and your business rules are expressed through your domain model.  An invalid domain model means your business rules were violated, and anything you do from this point forward will be suspect.  I'm going to say it again: this is really important.  If your domain model is in an invalid state, your application has failed.  Period.  End of story.  Your only recourse is to revert it back to it's last known consistent state and throw away all pending operations.

What about relationships that are not bi-directional?  For example, perhaps your model has PrivateEye and Subject types.  Clearly the PrivateEye needs to know about his Subject, but it'd be kind of silly if the Subject knew about the PrivateEye.  In this case the relationship only moves one way, so there is only one reference (from PrivateEye->Subject), and there is no invariant.  When we put this in the database, however, we have exactly the same structure as the bi-directional Person<->Pet relationship: a foreign key that can be traversed in two directions.  With the database representation of the model you can't express the concept of a one-directional relationship.  This is a powerful differentiator for in-memory models, since it gives you much finer control over the semantics of your model than a database could ever provide.

So where does this relate to ORM?  Just like everything else in your application, Hibernate depends on your invariants being true in order to persist your model to the database.  If they're not true, Hibernate can't hope to do it's job correctly.  A huge number of "problems" that people starting out with Hibernate face have nothing to do with Hibernate itself, but rather are caused by an invalid in-memory domain model.  Coming from the world of procedural, SQL-based persistence, you don't necessarily have to worry about an in-memory domain model's integrity, which means that you can write what amount to buggy applications where the bugs never manifest themselves.

Bottom line is that if you're using an in-memory domain model, you simply must ensure the invariants of that model remain true.  More specifically, you must set both sides of every bi-directional relationship.  If you don't, you're just asking for punishment, both from your software tooling and from users/clients of your application.

Want Multiple Persistence Contexts in CF9 ORM?

I do.  Because god knows front controllers and persistence layers don't have a one-to-one correspondence.  Turns out that through the magic of application switching you can do it, as long as you're careful with your sessions.  Not going to go into details right now, but this code works (and does what you'd expect):

<cfapplication name="ormtest1" />
<cfset companies = ormExecuteQuery('from Company') />
<cfinclude template="../app1/dsp_home.cfm" />

<cfset ormCloseSession() />

<cfapplication name="ormtest2" />
<cfset people = ormExecuteQuery('from Person') />
<cfinclude template="../app2/dsp_home.cfm" />

Here's the Application.cfc that this snippet runs under:

component {
 this.name = "ormtest3";
}

No ORM awareness at all.

The primary gotcha is that the other applications (ormtest1 and ormtest2) have to have been previously initialized (which you can do with a simple HTTP GET), but this allows you to mix multiple persistence contexts together (much like you can with normal Hibernate), and beat the stupid Application.cfc binding to some degree.  You still can only exist in a single context at any given moment, but you can switch back and forth any place you can close/reopen a session.

And for those of you who are thinking ahead, no, you can't manually manage your session objects instead of letting CF.  You can get a cross-application reference to the SessionFactory without much issue, and you can create a session, and you can even query for entities, but when you get to traversing lazy-loaded relationships it bombs out for some reason.  CF needs to have control over the active session or it doesn't work.  Rather sad, but at least running multi-persistence-context apps is possible, even if not concurrently.