Click here to Skip to main content
15,890,670 members
Articles / Database Development

SQL Joins and Subqueries

Rate me:
Please Sign up or sign in to vote.
2.00/5 (1 vote)
23 May 2011CPOL3 min read 38.1K   3   2
A discussion about SQL joins and subqueries with information about how to format join statements properly

This is a discussion about SQL joins and subqueries with information about how to format join statements properly.

In any non-trivial task, developers need to join tables together. Joins are important when we want to:

  1. Display/get data from more than 1 table in a result set.
  2. Filter out records based on a particular restriction which requires using a column from another table.
  3. Connect to an intermediary [relationship] table that helps connect two primary [entity] tables which have a many to many relationship. For example, if Student and School are two entity tables, then EnrollmentRecords might be the relationship table connecting the two.

There are certainly variations and extensions of why joins might be needed, but the above 3 cover most of the circumstances.

There is an alternative to using joins in some scenarios – subqueries. In instances where the goal is to filter on a restriction, a subquery can be used. Consider the following 2 SQL statements [using the Adventureworks database]:

SQL
-- Use a Subquery
SELECT * FROM AdventureWorks.Person.Address 
WHERE StateProvinceID IN
(
	SELECT StateProvinceID 
              FROM AdventureWorks.Person.StateProvince 
              WHERE StateProvinceCode = 'CA'
)
-- Use a Join
SELECT addr.* 
FROM AdventureWorks.Person.Address addr
INNER JOIN AdventureWorks.Person.StateProvince state 
ON addr.StateProvinceID = state.StateProvinceID
WHERE state.StateProvinceCode = 'CA'

These are both functionally equivalent. Which of these is better? In this trivial example, both are fine. However, when query speed becomes an important factor, joins are generally going to outperform subqueries. I'll explore this issue more in my next post.

Also, the queries, as-is, use select *. This is generally not a good practice – instead, a column list should be provided in the select statement. However, having the select * raises one important difference between the 2 queries above – the columns from the joined table will be available in the result set whereas the columns from the table in the subquery will not. This is why I specified the table alias in the 2nd query [for the select *] to make the results display equivalently.

With the join, notice there are two conditions – one that the StateProvinceID must be equal and one that the state code needs to be 'CA'. Why did I put the StateProvinceID check in the 'on' clause and the state code check in the 'where' clause? I think doing it this way is cleaner. When I join tables, I consider there to be three types of join conditions:

  1. Those that are fundamental to the tables [these would normally be identified by foreign key relationships or, lacking those, identically named columns. These should always be in the 'on' clause.
  2. Those that aren't fundamental to the tables but that are almost always used as filters when joining on a table. For example, on the student enrollment table at Boston Public Schools, we have a field for whether the record is the most recent one and whether the withdrawal code is set for that record. Almost all queries we use check to ensure the record is the most recent one and that the student isn't withdrawn. While these two filters are not fundamental join conditions when joining with enrollment table, they are so frequent that it simply makes sense and is cleaner to have these in the 'on' clause instead of in the 'where' clause.
  3. Those that serve as potentially modifiable filters. For example, a check that the state code is 'CA' or that some test score is > 30 should probably be in the where clause. Theoretically, if this logic is in a stored procedure, there should be an input parameter for the value to check/numerical threshold for filters in this category.

It is also possible to join the tables without any 'on' condition. For example, the following is functionally equivalent to the above two statements:

SQL
SELECT addr.* 
FROM AdventureWorks.Person.Address addr, AdventureWorks.Person.StateProvince state 
WHERE addr.StateProvinceID = state.StateProvinceID
AND state.StateProvinceCode = 'CA'

This is an implicit join with a comma separating the joined tables and all conditions in the where clause. The original join [with an 'on' condition] is an explicit join. Is the implicit join syntax recommended? Functionally and speed-wise, these two are identical. However, for queries with a number of tables being joined, the implicit syntax can become unmaintainable and difficult to debug/extend. I always prefer the explicit join syntax.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
SuggestionHave you considered 'exists'? Pin
smcnulty200028-Jun-11 15:03
smcnulty200028-Jun-11 15:03 
The use of Exists is OFTEN faster for this kind of query. I say 'often' because all rules for sql tuning are subject to a lot of other factors.

SELECT * FROM AdventureWorks.Person.Address addr
WHERE exists
(
SELECT 'x'
FROM AdventureWorks.Person.StateProvince sp
WHERE StateProvinceCode = 'CA'
and addr.StateProvinceID = sp.StateProvinceID
)

Note that I have 'x' in the subquery because I want to carefully limit how much data has to be extracted from the given table. It could be StateProvinceCode or StateProvinceID without damaging performance since those are part of what the subquery must retrieve in any case. If I used an asterisk here it would work but would retrieve all fields- which is not needed and almost certainly won't improve speed.

In either case the purpose of these two joins is to filter the results down rather than for the purpose of pulling data together. If you need data from the multiple tables you must use a join (unless you want to do something more complicated than this).

In this particular case you could also probably get away with an equals. All of these examples implicitly assume there is only a single 'CA' entry in the StateProvince table so an equals sign could work.

SELECT * FROM AdventureWorks.Person.Address addr
WHERE addr.StateProvinceID =
(
SELECT max(sp.StateProvinceID)
FROM AdventureWorks.Person.StateProvince sp
WHERE StateProvinceCode = 'CA'

)

I use MAX here to return only a single value even if there are duplicates in the table. After all, we don't know for sure that StateProvinceId is a unique id. If you remove the MAX you can get situations like that to pop an error that you can catch- but the program will fail. It all depends on what you want it to do with that situation and what you believe about the tables in question.

You could also do this:

SELECT * FROM AdventureWorks.Person.Address addr
WHERE
(
SELECT max(sp.StateProvinceCode)
FROM AdventureWorks.Person.StateProvince sp
WHERE
addr.StateProvinceID = sp.StateProvinceID

)= 'CA'

This method might be fast if you are going to be retrieving a lot of the Address table and only trimming out a small portion. It also might depend on what is and isn't indexed in the table.

The versions that don't use a join have the benefit that none of them would return multiple rows for the same record in the Address table. With this set of tables that might not be a danger with the join but it doesn't take much in the way of join complication to change that.
_____________________________
Give a man a mug, he drinks for a day. Teach a man to mug...

GeneralMy vote of 2 Pin
shankaranarayana31-May-11 18:42
shankaranarayana31-May-11 18:42 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.