Embrace Your [HS]QL

I'd like to start with a confession.  I love SQL (and it's Hibernate counterpart HQL).  Its simplicity and elegance is remarkable.  Its expressive power is stunning.  It is almost universally disliked by developers.  Why is that?

The answer, at least to me, is pretty straightforward.  Developers spend a lot of their time programming, so that's where their strengths lie.  Most programming is done with general-purpose languages (GPLs – not to be confused with the Gnu Public License), which are generic, Turing-complete environments.  They a real-world-like typing system (numbers, objects, lists, etc.), and often mutable data structures.  Developers get very good at using such languages to get stuff done.  This isn't a revelation.

SQL, on the other hand, is a domain-specific language (a DSL).  It is not Turing-complete, uses set-based data structures (which are immutable – more on this later), and just generally does stuff in a totally different way than a GPL.  This can create friction for developers, because they have to change mindsets when shifting between their general programming tasks (with a GPL) and their querying tasks (with SQL).

This bit of friction, however, is a source of great power if you're willing to invest a little in the SQL way of doing things.

The Bad Stuff

I'm not going to deny that writing out big-ass INSERT and UPDATE statements is a bear.  CRUD in general – SQL or otherwise – is simply an unpleasant task that should be avoided at all costs.  Fortunately there exist myriad tools and frameworks to alleviate with this problem (Hibernate being a shining example).  For the purposes of this discussion I'm only going to be talking about data recall (i.e., SELECT statements).  The C_UD part of CRUD is a different discussion, and hopefully you're using some "magic" way of dealing with that stuff.

The thing that is interesting to note, however, is that all the toolkits that I've seen to isolate you from the C_UD part of CRUD still provide a query language for the _R__ part.  So clearly a query language (SQL or otherwise) is an important part of the way things work, even if the whole point of a toolkit is to isolate you from SQL itself.

The Good Stuff

First, the "S" in "SQL" stands for "Structured".  Queries using SQL should be structured into little atomic pieces (subqueries) that are easily understandable and then assembled in to larger structures (with no semantic change to the composed bits).  Think of SQL as a box of LEGOs where you build up larger and larger structure from incredibly simple base components.  Here's an example of what I mean, working through it first with HQL, and then with SQL.

HQL

Consider this HQL example (from the cf-orm-dev mailing list):

select distinct p
from Post p
  left join p.categories c
where c.name like 'top%'

This selects the distinct Posts that have a category which starts with "top".  Straightforward.  But not really a very good use of the structured nature of HQL, because we're really asking two nested questions.  The first is the obvious – which Posts – but the second is more subtle: which Categories start with "top".  Consider this version:

from Post p
where exists (
  from p.categories
  where name like 'top%'
)

This is better; now we have a representation of both questions we're asking, as well as the "nested" relationship between them.  But this is still a little wonky because we're doing a dependent subquery when the nested question (which Categories start with "top") is really completely isolated from any association with Posts.  So consider this one:

from Post p
where any elements(p.categories) in (
    from Category
    where name like 'top%'
  )

Now we have the inner question totally isolated from the outer question, and the association between the two questions is expressed via the "any elements in" clause.  Very expressive, very modular, very structured.  And unfortunately, very unsupported by RDBMSs.  This is a valid query construct, but I don't know of any databases that actually support this ANY..IN construct (I tried MS SQL, MySQL, and Postgres).  So you can't actually use it.  Of course, a better HQL and/or SQL preprocessor could transform it, but they don't do that either.  In my opinion this is really Hibernate's problem to fix; it should generate different SQL for this construct.

Unfortunately, there's an … uh, "un-silver" … lining.  In this particular case, most DBs will execute the JOIN more quickly than the subquery, because the subquery is correlated (i.e. must be executed once per row in the outer query).  Aside from the semantic separation, that's another reason that the third (unsupported) query is better.  Since the subquery in the third rev is not correlated it only needs to be run once and the same result can be used to filter all the Post rows.

SQL

Here's the SQL equivalent of the first HQL query:

select distinct p.*
from Post p
  left join link_post_category l on p.id = l.post_id
  left join Category c on l.category_id = c.id
where c.name like 'top%'

This illustrates an interesting fact that the HQL doesn't show as clearly: in order to do the DISTINCT, the SQL engine must do a comparason on every column of every row.  If you're pulling back a few BLOB or CLOB fields this can be prohibitively expensive.  But if your column width is small, this is still fast because it suits the optimizer.  Now here's the correlated subquery version:

select *
from Post p
where exists (
    select *
    from link_post_category l
      left join Category c on l.category_id = c.id
    where l.post_id = p.id
      and c.name like 'top%'
  )

Now we've got rid of that "every column of every row" problem, but replaced it with a correlated subquery.  So lets ditch that next:

select *
from Post p
where id in (
    select l.post_id
    from link_post_category l
      left join Category c on l.category_id = c.id
    where c.name like 'top%'
  )

Now the subquery is pulling back all the post_ids we want to retrieve (in one step), and the outer query is just doing a primary key lookup to grab those rows.  This is going to be crazy fast, but we still have that JOIN in there which makes the subquery kind of nasty.  One more revision:

select *
from Post p
where id in (
    select post_id
    from link_post_category
    where category_id in (
        select id
        from Category
        where name like 'top%'
      )
  )

With a pair of nested subqueries we're down to that LEGO-like ideal.  Starting at the innermost piece, we pull all the Categories that start with "top".  Then we find all the post_ids that are assigned those Categories.  Finally we pull all the Posts with those post_ids.  With MS SQL's query optimizer this will have the same performance characteristics as the prior version (i.e., crazy fast), but MySQL will unfortunately struggle a little bit.

Remember that final HQL query that won't actually execute on any DB I found?  Here's the SQL for it:

select *
from Post p
where any (
    select category_id
    from link_post_category
    where post_id = p.id
  ) in (
    select id
    from Category
    where name like 'top%'
  )

This is roughly equivalent to the second SQL version, but since it won't run, it's academic.

The SQL Way

If you're going to use SQL effectively, you have to think about SQL in it's terms, it's domain.  As a DSL, it's not designed to be a GPL.  It's designed to be used from a set theory perspective and expressed in a very structured fashion.  If you try to use it like a GPL you'll be able to get the job done, but you'll also get frustrated and build some really nasty queries that will be hard to maintain/reuse.  And then you'll hate SQL, and then you'll try to avoid it (and other query languages) at all cost, and really be missing out.

As an example, I challenge you to dig into the Hibernate Criteria API, which is a programmatic way of querying (i.e., an API for a GPL).  Try and create a Criteria query that expresses this question "what is the average age of males with either zero or two or more children" in a way that is even close to readable and concise.  Here is the HQL version (which is both readable and concise):

select avg(age)
from Person
where sex = 'male'
  and (children.size >= 2
    or children.size = 0
  )

As way of a hint, you'll need a few Restrictions instances along with a Projections instance, plus some glue bits.

Immutability

I mentioned the immutability of SQL data structures way back at the beginning, and now I'll follow through.  Obviously state can be changed via UPDATE, INSERT, and DELETE statements, but that's not what I mean.  When you're running a query you have the ability to create temporary data structures (typically subquery results).  They are immutable.  Even if you're using a "programming" SQL extension (e.g., T-SQL or PL/SQL), state is still typically immutable.  You can change variable references, but you can't go changing the objects they reference.

For example, if you have a rowset and want to add a new row, how do you do it?  You don't add a new row, you select the UNION of the current rowset and a new (very likely ad hoc) rowset containing the row to add.  This is a very different paradigm than most GPLs use, where state is typically mutable, at least for reference types.  Consider a Java List.  Calling .add(o) doesn't create a new List with the new item added to it, it just adds it to the existing list.

Functional languages are an exception, as many (most?) embrace the concept of immutable state (or zero side effect) which removes any need to consider concurrency, allows programs' behaviour to be "proven", etc.  In a zero side effect environment, calling .add(o) would create a new List with the new item added and return the new list (the original list would be discarded).  It's definitely a different approach to doing things from what most people are used to, but provides some incredibly powerful capabilities.

Stop Already

Ok, ok.  That ended up being about ten times longer than I intended.  I'm sorry.  But if you got this far you've hopefully grown a bit of an appreciation for SQL, or at least the knowledge that there might be something there to appreciate.  And do keep in mind that the examples I've given are all really simple.  The power of the "Structured" is sort of nice with queries just a few lines long, but it really shines when you're dealing with queries that stretch over hundreds of lines, especially when they're dynamically constructed.  That incredibly granular breakdown of functionality allows you to reuse bits and pieces very effectively, and the zero side effect nature means that you are guaranteed not to have some random side effect ruin your day.

4 responses to “Embrace Your [HS]QL”

  1. figital

    I love SQL. Especially the _R__ part. Pay attention to your indexing plans and you can do some amazing, speedy things. +1

  2. Jaime Metcher

    Another good one Barney – you're on fire these days! In fact I was moved to post a "me too" on my own blog.

  3. John Allen

    Sharp! I suck at SQL… I will get better as SQL, I will get better as SQL, I will get better as SQL.

  4. Jake Munson

    Awesome post. And somewhat similar to a post of mine from a year ago:
    http://techfeed.net/blog/index.cfm/2009/3/2/Embrace-JavaScript-and-SQL