Introduction
There is nothing more difficult to master than repetition. If you do it badly, it's clumsy, stupid. When it's well done, it's like a little echo, like waves, poetry itself.
Laurence Cossé “A Novel Bookstore”
Contents
Why this article?
Anyone who spends any time participating in the Quick Answers forum here on Code Project (as well as a similar forum on “another” site), will see some common themes crop up in the questions time and time again.
Within the SQL topic, one subject is particularly common and it usually worded like “How can I write a loop in SQL to get …” followed by whatever results the poster is trying to achieve.
This article is an attempt to bring together several potential solutions to that type of question, based on the principle you don’t need a loop!
So I am aiming this primarily at SQL “beginners”, showing how the many features of SQL Server mean that loops are very rarely necessary. There are some tricks included, however, that Intermediate users may find useful. I'm using worked examples to explain each alternative.
What is a loop?
It may seem obvious but a formal definition never hurts:
In computer programming, a loop is a sequence of instructions that is continually repeated until a certain condition is reached. Typically, a certain process is done, such as getting an item of data and changing it, and then some condition is checked such as whether a counter has reached a prescribed number. If it hasn't, the next instruction in the sequence is an instruction to return to the first instruction in the sequence and repeat the sequence. If the condition has been reached, the next instruction “falls through” to the next sequential instruction or branches outside the loop. A loop is a fundamental programming idea that is commonly used in writing programs.[^]
Something else that I’ll mention occasionally in “Gotcha!” sections, is an “infinite loop”. Here is a definition:
An infinite loop is one that lacks a functioning exit routine . The result is that the loop repeats continually until the operating system senses it and terminates the program with an error or until some other event occurs (such as having the program automatically terminate after a certain duration of time).[^]
Why do we think we need a loop?
Many newcomers to the use of relational databases have come from a procedural programming background. Examples include the many students studying at “colleges” who simply can’t let VB6 die a natural death; the poor guys trying to maintain legacy systems that some corporations just won’t let go of; the Excel or Access users who are upgrading their skills.
With that background, loops often seem to be a natural answer to problems that can often be stated in terms like:
- “For each row in the table …”
- “For each date between two given date …”
- “For each month in the data …”
To be honest, VB6 and VBA almost forced the use of loops with the ADO Recordset object, because “at any time, the Recordset object refers to only a single record within the set”[^]
Typical VB6/VBA code might have looked like
Dim rs As ADODB.Recordset
‘. . . code that populates the record set
With rs
If Not .BOF And Not .EOF Then
.MoveLast
.MoveFirst
Do While Not (rs.EOF)
rs.MoveNext
Loop
End If
.Close
End With
Excel can promote similar thinking due to the way calculations automatically change per row simply by dragging the cell downward…
A scenario that can reinforce this thinking is where the project requirements set out multiple actions to be done on a single set of data - something like:
Present a report based on table myTable which has the following information
- Display each row with its rank based on value
- Count the number of rows
- Determine which row has the largest value
- Work out the total and the average of all of the values for each product
- Capture the dates of the first and last entries
- If a row has a certain attribute compared to another table update a status field
Some of those instructions have to be repeated for each row so I must need a loop, right?
While I’m looping through those rows I’ll have variables to capture the running total and a counter and the maximum date and the minimum date… that’s a good idea, right?
Then I need to loop through my table and update that other table as I go. Right?
Although this example is contrived, it’s easy to see how you could be drawn into the world of procedural loops. The answer to all of these comments is no, no, no!
Sample Data used in the Examples
All of the examples in this article use the NorthWind sample database, available free from Microsoft. I downloaded the creation scripts from the Microsoft Download Center[^] however entire databases are available from the same source if you prefer.
Beware! The scripts have not been updated for some time so bear in mind that any dates will be historic (or possibly prehistoric!)
So explain to us why we don’t need to use loops!
When you’re starting out with databases, the key concept to remember is that relational databases are set-based.[^]
They are designed to act on bunches of related things (data) all at once. The relational database will handle the fact that there are many rows on which to perform the action. Essentially, you write something that describes what you want to happen, and let the Relational Database Management System (RDBMS) take care of how it is going to make it happen.
For example: Imagine the requirements are “For any Product, if the Units In Stock are less than 20 units and there are no Units On Order, Discontinue the product by setting the Discontinued flag to 1”.
The pseudo-code to do this in a procedural way would be
Start at the beginning of the record set
While there are still records to process
If Units In Stock < 20 AND Units On Order = 0 Then
Set Discontinued = 1
End If
Move to the next record
End While
But with set based logic the instruction becomes very simply
UPDATE Products SET Discontinued = 1 WHERE UnitsInStock < 20 AND UnitsOnOrder = 0
Gotcha!
Remember the definition of an infinite loop from the introduction? One equivalent of an infinite loop in a SQL update statement would be to forget to include a WHERE clause.
Any data contained in a single table is (or should be) “related”[^] it is part of a “set” of information. So any instruction issued against the table is going to affect the entire table – unless you restrict the effects using a WHERE clause[^]
Gotcha again!
If you look at the original data in the Products table, some of the products have already been discontinued. With many RDBMS, our current instruction will set the Discontinued flag on those records again. We are needlessly updating records that we don’t need to. It doesn’t have much impact on our small database but if there were thousands of records that were already discontinued, then we could waste time and resources. So, although it is not explicitly mentioned in requirements we were given, we should really add another filter …
UPDATE Products SET Discontinued = 1 WHERE UnitsInStock < 20 AND UnitsOnOrder = 0 AND Discontinued = 0
The revised query will affect fewer rows – Our new set only contains those rows that actually require an update.
Alternatives to Loops - Joins and Sub-Queries
It’s easy enough for me to say “Don’t use loops in SQL” but it would be a little unfair not to offer some alternatives with a little more complexity than I’ve offered so far. The following section is going to give some worked examples to give you some ideas on other techniques that can be used instead. I’ll also refer to some other articles, mostly here on CodeProject, where the authors have fully explained a topic better than I have room for here.
Filtering data
Imagine the requirement “Get a list of all of the orders for customers based in London”.
When I look at the Orders table I can see a column [ShipCity] and when I query against it with
SELECT OrderID, CustomerID FROM orders WHERE ShipCity = 'London'
I get a list of 33 orders. Great, so I pass that information back to the User – who comes back with a complaint – “But what about the orders for my client ‘Around the Horn’?”
Going back to look at the data I realise that this client ‘Around the Horn’ is based in London, but has their orders shipped to Colchester. My first query doesn’t fit the requirements.
There may be a temptation to create a loop along the lines of:
Create a table to contain the results
Create a recordset containing only Customers where City=’London’
Start at the beginning of the recordset
While there are still records to process
Copy all the orders for this Customer to the output results
Move to the next Customer record
End While
But I know that SQL is set based ,so I do this instead - Get all of the orders where the Customer ID is in the set of Customers where City=‘London’ Or:
SELECT * FROM orders WHERE CustomerID IN (SELECT CustomerID FROM Customers WHERE City= 'London')
That’s better. Now I get 46 rows and Customer ‘Around the Horn’ is definitely included in the new results.
This is an example of using a sub-query (or inner query) to identify the set of data we want to use. You can read more about sub-queries on the technet article Subquery Fundamentals[^].
Sub-queries are a great way of limiting the items in a set when used with the WHERE clause, but they can also be used in the SELECT statement too. This is an easy way of including summary information.
Imagine the requirements actually said “Get a list of all of the orders for customers based in London and show a comparison of each row’s freight against the average freight for all orders”.
The procedural approach would probably look like this
DECLARE @OverallAvg NUMERIC(15,2) SET @OverallAvg = (SELECT AVG(Freight) FROM Orders)
SELECT *, Freight - @OverallAvg as AvgFreight FROM orders
WHERE CustomerID IN (SELECT CustomerID FROM Customers WHERE City= 'London')
But by using another sub-query we can do everything we need in a single statement
SELECT *, Freight - (SELECT AVG(Freight) FROM Orders) as AvgFreight
FROM orders
WHERE CustomerID IN (SELECT CustomerID FROM Customers WHERE City= 'London')
Now as my orders table grows and grows, my DBA has become concerned about performance and asks me to take another look at my query to see if I can improve it.
Let’s go back to the fact that SQL is set-based. I can define the set of data I want to work with by referring to more than one table by using Joins.
Specifically in this case, an INNER JOIN will only return data that is in both tables. In other words, you can use an Inner Join to help filter your results.
[Note: There are many articles available on how and when to use joins. A full discussion of that topic is out of scope for this article but this CodeProject article is worth a read - Visual Representation of SQL Joins[^]]
Here is the first example above, this time using an Inner Join instead of a sub-query
SELECT O.*, Freight - (SELECT AVG(Freight) FROM Orders) as AvgFreight
FROM orders O
INNER JOIN Customers C ON O.CustomerID = C.CustomerID
WHERE City = 'London'
Or even
SELECT O.*, Freight - (SELECT AVG(Freight) FROM Orders) as AvgFreight
FROM orders O
INNER JOIN Customers C ON O.CustomerID = C.CustomerID AND City = 'London'
Both of these queries return all of the data I was expecting, but much more efficiently (in this case) than using the IN clause. What about that sub-query in the SELECT though?
Well we can use sub-queries in the JOINs too, like this
SELECT O.*, Freight - SQ.OverallAvg as AvgFreight
FROM orders O
INNER JOIN Customers C ON O.CustomerID = C.CustomerID AND City = 'London'
INNER JOIN (SELECT AVG(Freight) AS OverallAvg FROM Orders) SQ ON 1=1
Because there is only one value being returned from our temporary table produced from SQ I’ve used the trick of using an ON clause that is always true... ON 1=1
.
Unfortunately this isn’t necessarily a better way in terms of performance, but I do think it keeps the SELECT clause a little tidier.
[Note. I’ve said that JOIN performs better than sub-queries for queries like this, however discussions about performance are beyond the scope of this article. I’ve included some suggested further reading material in the references at the end of the article]
Filtering data with NOT IN and NOT EXISTS
Now we know we can use Inner Joins as a way of including a filter on our data set, what happens if the requirements change?
Imagine our users now need me to keep a list of telephone numbers where the Customer has stated they do not want to be contacted by phone. I decide to keep a list of those Customers in a very simple new table …
CREATE TABLE DoNotTelephone
(
[CustomerID] [nchar](5) NOT NULL
) ON [PRIMARY]
And I populate this table with the Ids of Customers that should not be contacted by phone.
INSERT INTO DoNotTelephone VALUES
('LAZYK'),('LETSS'),('NORTS'), etc…
The users now ask me for a list of telephone numbers for a Sales campaign, they want a list of all USA Customers. They don’t mention it, but of course, I shouldn’t include any numbers in the DoNotTelephone list.
Ok, we know by now that we definitely don’t need to use a procedural loop. You might think that a join is probably a good option, but how do you join to something that isn’t there?
To do this, I can exploit a feature of LEFT OUTER
Joins. An INNER
join will only return records that are in both tables. A LEFT OUTER
join will return records for every row on the “left-hand” table and, if the row does not exist on the “right-hand” table SQL will provide NULL
values for the columns coming from that table. Otherwise SQL will provide the actual values from the right-hand table.
If that’s not clear then have another quick read of Visual Representation of SQL Joins[^]
So, if I filter the results to only get the records with NULL
values from the DoNotTelephone table, it is the equivalent of saying “not in the DoNotTelephone table”…
SELECT ContactName, ContactTitle, CompanyName, C.Phone
FROM Customers C
LEFT OUTER JOIN DoNotTelephone DNT ON C.CustomerID = DNT.CustomerID
WHERE Country = 'USA'
AND dnt.CustomerID IS NULL
But as more and more records are added to the DoNotTelephone list this query will take longer and longer to run as we are querying values we don’t actually need.
Fortunately, the NOT IN
clause seems to be exactly what I need:
SELECT * FROM Customers
WHERE CustomerID NOT IN
(SELECT CustomerID FROM DoNotTelephone)
AND Country = 'USA'
But I’m still querying data that I don’t need – any of the clients that are not in the USA but are on the DoNotTelephone list are still queried despite the fact we don’t need them.
The (probably) better solution is to use NOT EXISTS
. Why? Well the sub-query will not return any actual data – it only returns the equivalent of True
or False
. In the amended query below I’ve used SELECT 1
rather than SELECT Phone
just to emphasise that fact.
SELECT * FROM Customers C
WHERE NOT EXISTS (SELECT 1 FROM DoNotTelephone WHERE CustomerID = C.CustomerID) AND Country = 'USA'
I’ve touched briefly on the potential problems of querying for data that we don’t actually need. If you want to read more on how sub-queries can impact on performance then there is some further reading material suggestions at the bottom of this article.
Updating tables using Joins
A very common scenario involves updating one table based on the contents of another one.
Consider these requirements “For any Product in the ‘Seafood’ or ‘Meat/Poultry’ categories set the Discontinued flag to 1”.
Look at the Products table and you’ll notice that the Category for each product is represented by CategoryID
, which is a number instead of text. ‘Seafood’ is category 8 and ‘Meat/Poultry’ is category 6.
We need to define which records are in the set of data which needs to be updated. I could query the Categories table to find the row for ‘Seafood’, take note of the CategoryID
and then use that number to update the Products table … but you would have to repeat that process for the ‘Meat/Poultry’ category.
--Attempt #1
DECLARE @id INT = (SELECT CategoryID FROM Categories
WHERE CategoryName = 'Seafood')
UPDATE Products SET Discontinued = 1 WHERE CategoryID = @id AND Discontinued = 0
SET @id = (SELECT CategoryID FROM Categories
WHERE CategoryName = 'Meat/Poultry')
UPDATE Products SET Discontinued = 1 WHERE CategoryID = @id AND Discontinued = 0
That is just awful! I’m repeating all that code.
This is the danger point where our procedural brains decide “well a loop would work well there”. The temptation is to loop through the Categories – the pseudo–code looks like this
Create a recordset containing only Categories that match the requirements
Start at the beginning of that recordset
While there are still records to process
Update the Products table for this CategoryID
Move to the next Category
End While
But by now we’re starting to understand that we’re not going to do it that way!
I could use the IN
– clause with the hard-coded values 6 and 8 that I’ve looked up manually.
--Attempt #2
UPDATE Products set Discontinued = 1
WHERE CategoryID IN (6,8)
AND Discontinued = 0
But in a couple of months I might forget where those “magic numbers” came from or, worse still, a colleague has to pick up my work and just doesn’t know where those numbers came from in the first place!
The use of unnamed magic numbers in code obscures the developers’ intent in choosing that number, increases opportunities for subtle errors … and makes it more difficult for the program to be adapted and extended in the future. [^]
Fortunately, I can do the update all in one go and still make it obvious what I’m trying to achieve by utilising a join…
-- Attempt #3
UPDATE Products set Discontinued = 1
FROM Products P
INNER JOIN Categories C ON P.CategoryID = C.CategoryID
WHERE CategoryName IN ('Seafood', 'Meat/Poultry')
AND Discontinued = 0
It's worth taking a moment to look at the differences between those last two SQL statements. (“Attempt #2” and “Attempt #3”)
- The bit that is doing the UPDATE is the same in both – that makes sense, that is what we’re trying to achieve.
- The WHERE clause is very similar … the first version has “magic numbers” whereas the second version explicitly uses the words from the requirements, ‘Seafood’ and ‘Meat/Poultry’, but the clauses are still essentially the same.
- The INNER JOIN line in #3 looks okay – nothing special there although perhaps we weren’t really expecting it in an UPDATE statement.
- That extra FROM in Attempt #3 is weird though. It almost looks like we are defining a SELECT rather than updating a table. Which in a sense is exactly what we are doing …
We are defining the set of data on which we want to perform the update. Think of it as “update the rows that came FROM this query”. That Inner Join now makes sense as we have already seen how joins can help filter a set of data.
In the example above we were setting a column value to a single value for all rows in the set. What if we want to set values based on the actual values in another table?
Imagine our support department has been moving some servers around and have relocated all of the photographs of our Employees. They have graciously provided us with a list of Employee Ids and the new location of their photographs. The table is quite simple:
CREATE TABLE [dbo].[newPhotos](
[EmployeeID] [int] IDENTITY(1,1) NOT NULL,
[PhotoPath] [nvarchar](255) NULL
) ON [PRIMARY]
We use the same technique as we did for the Category i.e. a Join, but this time the Join doesn’t just provide a filter (the set of Employees whose photos have moved), it also provides the new value that we need to use for each Employee:
UPDATE Employees SET PhotoPath = New.PhotoPath
FROM Employees E
INNER JOIN newPhotos New ON E.EmployeeID=New.EmployeeID
The key feature here is that you need to mention the table you are updating twice... The UPDATE
needs to know which table to update, but we also need to use the table in the way we define the set of data to use in the update.
To save typing you can use the table alias instead of the full name
UPDATE E SET PhotoPath = New.PhotoPath
FROM Employees E
INNER JOIN newPhotos New ON E.EmployeeID=New.EmployeeID
Running Totals using Self Joins
To get “running” totals or counts from a set of data you can use a table with a “self-join” i.e. a join to itself.
A self-join is a query in which a table is joined (compared) to itself. Self-joins are used to compare values in a column with other values in the same column in the same table. One practical use for self-joins: obtaining running counts and running totals in an SQL query. [^]
Imagine we want to get a running total and a running count of all orders by product. We can use:
SELECT P.ProductName, t1.OrderID, t1.Quantity,
RunningTotal = SUM(t2.Quantity),
RunningCount = COUNT(t2.Quantity)
FROM [Order Details] AS t1
INNER JOIN [Order Details] AS t2 ON t1.ProductID = t2.ProductID
AND t1.OrderID >= t2.OrderID
INNER JOIN Products P ON t1.ProductID=P.ProductID
GROUP BY P.ProductName, t1.OrderID, t1.Quantity
ORDER BY P.ProductName,t1.OrderID
Notice the AND t.OrderID > t2.OrderID …
- it’s that part of the ON
clause that provides the mechanism for the “running” total when we do the SUM
and COUNT
.
Here are the partial results of that query - note that both the Running Total and Running Count restart when the ProductName
changes
There are other ways of getting Running Totals discussed on this CodeProject article: Calculating simple running totals in SQL Server[^]
Gotcha! – Join Hints
I’m going to mention Join hints here very briefly and only because there is such a thing as a LOOP JOIN hint. It sounds as if it should belong in an article like this, I tend to disagree.
Hints are options and strong suggestions specified for enforcement by the SQL Server query processor on DML statements. The hints override any execution plan the query optimizer might select for a query.[^]
The SQL Server Query Optimizer does quite a good job of deciding the best execution plan for a query, although “on occasion it can be improved”[^]. However, I’m extremely wary of presuming that I can determine what those improvements should be!
More importantly, there is the distinct danger that hints get left behind in production code and never revisited after system upgrades. This increases the risk that they become counter-productive over time.
For the purposes of this article, which is aimed largely at beginners, I’ve made you aware that they exist and have provided some further reading material in the link in the text if you decide to pursue the subject further.
Solve lots of problems with Common Table Expressions
A common table expression (CTE) can be thought of as a temporary result set that is defined within the execution scope of a single SELECT, INSERT, UPDATE, DELETE, or CREATE VIEW statement. A CTE is similar to a derived table in that it is not stored as an object and lasts only for the duration of the query. Unlike a derived table, a CTE can be self-referencing and can be referenced multiple times in the same query.[^]
The best bit about that statement is that phrase “a CTE can be self-referencing” which means you can have Recursive Common Table Expressions (rCTEs) [^] – which are a great way of avoiding using procedural loops. Here are a few examples…
CTE Example 1 – Traversing a hierarchy
The Northwind database contains an Employees table which lists all employees in the company and also indicates which employees report to whom (who their manager is).
If we query the table
SELECT EmployeeID, ReportsTo, LastName, FirstName, Title
FROM Employees
ORDER BY ReportsTo, EmployeeID
We get
Which is fine, but isn’t very helpful when we try to picture the employees in an organisation chart. Here is the pictorial representation of how we would like to see those results
With the diagram it’s easy to see that Anne Dodsworth, Robert King and Michael Suyama are at the “lowest” level, Andrew Fuller is “the boss” and everyone else is at the same reporting level in the middle (Level 2). It’s not so easy to see that structure in the query results as we have to keep going back to the results to see who reports to whom.
We can get around that problem by using a Recursive Common Table Expression (rCTE) to traverse that hierarchy, and to get more information on each row, i.e. the “Level” in the diagram for each employee and the reporting path. Here is the full query
;WITH Emp_CTE AS
(
SELECT EmployeeID, ReportsTo, LastName, FirstName, Title
, 1 as RLevel
, MtoE = CAST(isnull(ReportsTo,0) AS VARCHAR(MAX)) + '/' +
CAST(EmployeeID AS VARCHAR(MAX))
FROM Employees
WHERE ReportsTo IS NULL
UNION ALL
SELECT e.EmployeeID, e.ReportsTo, e.LastName, e.FirstName, e.Title
, RLevel + 1
, MtoE = MtoE + '/' + CAST(e.EmployeeID AS VARCHAR(MAX))
FROM Employees e
INNER JOIN Emp_CTE ecte ON ecte.EmployeeID = e.ReportsTo
)
SELECT EmployeeID, EC.ReportsTo, LastName, FirstName, Title, RLevel, MtoE
FROM Emp_CTE EC
What is actually happening in this query?
The first part of the query
SELECT EmployeeID, ReportsTo, LastName, FirstName, Title, 1 as RLevel
,MtoE = CAST(isnull(ReportsTo,0) AS VARCHAR(MAX)) + '/' + CAST(EmployeeID AS VARCHAR(MAX))
FROM Employees
WHERE ReportsTo IS NULL
Is the “Anchor Member” for the CTE, also referred to as the “Invocation” – i.e. our starting point. In this case I’m looking for any Employee who doesn’t report to anyone else – the “boss”.
The next part of the query is the “recursive member”, which is called repeatedly until all of the records have been processed.
SELECT e.EmployeeID, e.ReportsTo, e.LastName, e.FirstName,
e.Title, RLevel + 1
, MtoE = MtoE + '/' + CAST(e.EmployeeID AS VARCHAR(MAX))
FROM Employees e
INNER JOIN Emp_CTE ecte ON ecte.EmployeeID = e.ReportsTo
The UNION ALL
combines all of the result sets together into a temporary result set called Emp_CTE
. Finally I query that temporary result set as if it was just another table on the database.
So our Anchor member returns this set of data
2 NULL Fuller Andrew Vice President, Sales 1 0/2
The first iteration of the recursive member returns the list of employees that report to Andrew – i.e. the first iteration uses the result set from the Anchor member as its input.
1 2 Davolio Nancy Sales Representative 2 0/2/1
3 2 Leverling Janet Sales Representative 2 0/2/3
4 2 Peacock Margaret Sales Representative 2 0/2/4
5 2 Buchanan Steven Sales Manager 2 0/2/5
8 2 Callahan Laura Inside Sales Coordinator 2 0/2/8
The next iteration returns the following results – a list of all of the people who report to Nancy, Janet, Margaret, Steven or Laura. An iteration of the recursive member uses the output from the previous iteration (of the recursive member) as its input, and moves down the hierarchy one level.
6 5 Suyama Michael Sales Representative 3 0/2/5/6
7 5 King Robert Sales Representative 3 0/2/5/7
9 5 Dodsworth Anne Sales Representative 3 0/2/5/9
The next iteration returns an empty dataset as Michael, Robert and Anne do not have anyone reporting to them and so the recursion ends. All of the results are combined as the output from the rCTE as the output data set and can be queried as you would query against any other table.
The entire rCTE gives these results
Let’s look in more detail at those derived columns RLevel and MtoE
RLevel is easy – we set it to 1 in the Anchor member and then increment it by 1 each time the recursive member is called. So it is showing the “level” within the hierarchy for each employee.
MtoE (Manager to Employee) tracks the “path” down the hierarchy tree to each employee using the EmployeeID at each level of reporting - “Level 0” / “Level 1” / “Level 2” / “Level 3” / …etc.
E.g. For Employee Anne Dodsworth the column MtoE comes out as “0/2/5/9” …
- Starting at the right of that string find Employee Id 9 at the lowest level (i.e. Anne Dodsworth)
- On the next level up (Level 2) find Employee Id 5 (Steven Buchanan)
- On the next level up (Level 1) find Employee Id 2 (Andrew Fuller)
- On the next level up (Level 0) find Employee Id 0 (No-one)
So Anne Dodsworth reports to Steven Buchanan who reports to Andrew Fuller who reports to no-one.
Being able to track the records like this could be useful to your presentation layer – e.g. so that you can list each manager immediately followed by their direct reports.
Gotcha! – MAXRECURSION
When using recursion it’s quite easy to get into an infinite loop situation with a poorly formed CTE.
Let’s have a look at that CTE for the managers reporting structure with just a minor adjustment – I’ve used the alias for the CTE within the SELECT of the recursive member instead of the Employees table i.e. I’ve used alias ‘ecte’ instead of ‘e’. It’s in bold in the code below.
;WITH Emp_CTE AS
(
SELECT EmployeeID, ReportsTo, LastName, FirstName, Title
FROM Employees
WHERE ReportsTo IS NULL
UNION ALL
SELECT ecte.EmployeeID, ecte.ReportsTo,
ecte.LastName, ecte.FirstName, ecte.Title
FROM Employees e
INNER JOIN Emp_CTE ecte ON ecte.EmployeeID = e.ReportsTo
)
SELECT EC.EmployeeID AS ID, ISNULL(E.LastName + ',' + E.FirstName, 'No-one!') as Manager, EC.LastName, EC.FirstName, EC.Title
FROM Emp_CTE EC
LEFT JOIN Employees E ON EC.ReportsTo = E.EmployeeID
ORDER BY MtoE, EC.ReportsTo, EC.EmployeeID
When I run that query I get an error
Msg 530, Level 16, State 1, Line 1 The statement terminated. The maximum recursion 100 has been exhausted before statement completion.
Strange! There are only 9 entries on the Employees table so I’m only expecting 9 rows in the result – well, certainly not more than 100! Errors like this can be difficult to stop but in this case I’ve done it deliberately. What is happening with the use of alias ecte instead of e? I’m not trying to pass the output from the previous iteration into the next pass, I’m actually repeatedly referring back to the previous result set (ecte) itself, which will always contain data (in this example) – i.e. an infinite loop.
Fortunately SQL Server has a default on the number of recursions (it’s 100 funnily enough) so I’m not left sitting around waiting for that query to (never) finish. But this is a great way of confirming my code doesn’t cause an infinite loop and if there is one SQL Server will kill it for me!
So I got the error message, looked back at my code, spotted the error in the recursive member and fixed those aliases from ecte to e. Now when I re-run the query I get the expected results we discussed above. Job done? Not quite.
What if there were more than 100 employees on the table? Is the default number of recursions going to stop our query before we get to where we need to be? In short, yes it is, but we can do something about it. We can use the MAXRECURSION hint [^] on the query.
Say we were expecting around 150 rows to be returned, we could add
OPTION (MAXRECURSION 150)
to the end of the query (after the ORDER BY clause)
Fantastic! I test it. It works. I can sign it off to production and go off to celebrate a successful implementation.
Wind the clock forward until our company now has 151 employees. Bang! The query starts to fail again. I could change the hint to some arbitrary large number, but there is nothing to say that our company isn’t going to go global and break whatever number I choose. I would have to change the code yet again. Don’t forget what we said about magic numbers earlier as well.
I can avoid this situation by telling SQL Server that we don’t want any restriction on the number of recursions, and I do this with
OPTION (MAXRECURSION 0)
The parameter of 0 (zero) does not mean “no recursions”; it means “no limit on recursions”.
Gotcha Again!
Never use OPTION (MAXRECURSION 0)
in a query until the query has been thoroughly tested…or you could go back to an infinite loop scenario.
CTE Example 2 – Comparing Previous and Subsequent values
Because CTEs produce temporary data sets similar to temporary tables, you have a lot of flexibility on what you can do with them. For example, you can have multiple CTEs which you can join, or you can also do a “self-join”. Here’s an example to demonstrate how that could be useful…
Imagine I’ve been asked to find out information about the frequency of Orders placed by client code ERNSH. Our users want to know how many days there are between each order for that client.
It’s simple enough to get the OrderDates for that client
SELECT CustomerID, OrderDate
FROM Orders
WHERE CustomerID = 'ERNSH'
And we get 30 rows of data …
Now to get the difference between the dates, it looks like I need to loop through those results, capture the order date in a variable, compare the current value of the variable to the next one in the loop … Let’s stop there – You get the idea and you know I’m not going to use a procedural loop anyway.
Have a look at this CTE:
;WITH CTE AS
(
SELECT CustomerID, OrderDate
,ROW_NUMBER() OVER(ORDER BY OrderID) AS rn
FROM Orders
WHERE CustomerID = 'ERNSH'
)
SELECT CTE.CustomerID, CTE.OrderDate, ISNULL(DATEDIFF(dd, prev.OrderDate, CTE.OrderDate),0) AS DIFF
FROM CTE
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn − 1
To our original query to get the order dates, we've added a call to the SQL function ROW_NUMBER() – which is going to give all of the orders a sequential row number between 1 and 30 (that 'sequential' is important…) based on the Order ID. Order ID has been defined as
[OrderID] [int] IDENTITY(1,1) NOT NULL
so ordering by that column means I'm going to get the Orders returned in the order they were entered.
I’ve placed that query into a simple Common Table Expression (“simple” meaning in this case, “non-recursive”) then I run a query against that CTE. That query includes a LEFT OUTER JOIN back to the same CTE – it has the ALIAS prev
above. That join gives us the record that has the previous row number to the row we are currently viewing.
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn - 1
So we can now compare the OrderDate from CTE to the OrderDate on the join prev
to get the days difference between orders.
DATEDIFF(dd, prev.OrderDate, CTE.OrderDate) AS DIFF
Our new query gives the results:
Note we could also look ahead to the next record by using
LEFT OUTER JOIN CTE nxt ON nxt.rn = CTE.rn + 1
And you can do just about anything with the CTE that you would do to a table – e.g. Group By, Averages etc
;WITH CTE AS
(
SELECT CustomerID, OrderDate
,ROW_NUMBER() OVER(ORDER BY OrderID) AS rn
FROM Orders
WHERE CustomerID = 'ERNSH'
)
SELECT CTE.CustomerID, AVG(ISNULL(DATEDIFF(dd, prev.OrderDate, CTE.OrderDate))) AS DIFF
FROM CTE
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn - 1
GROUP BY CTE.CustomerID
To get
ERNSH 21
CTE Example 3 – Using Multiple CTEs
I’m going to demonstrate another method of getting that average using a second CTE linked to the first. Note you wouldn’t really want to use this construct in this way with such a simple example. But because it is so simple it does demonstrate how easily multiple CTEs can be linked. There are more meaningful examples in this Code Project article CTE In SQL Server[^]
;WITH CTE AS
(
SELECT CustomerID, OrderDate
,ROW_NUMBER() OVER(ORDER BY OrderID) AS rn
FROM Orders
WHERE CustomerID = 'ERNSH'
), CTE2 AS
(
SELECT CTE.CustomerID,
ISNULL(DATEDIFF(dd, prev.OrderDate, CTE.OrderDate),0) AS DIFF
FROM CTE
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn - 1
)
SELECT CustomerID, 'Average', AVG(DIFF)
FROM CTE2
GROUP BY CustomerID
Note the second CTE (ingeniously named CTE2
) uses the first CTE to derive some data and the final query only refers to CTE2.
This is all very well, but there are lots of clients on our database. We don’t really want to have to run this query one-by-one for each client. Fortunately there is a version of that ROW_NUMBER function that will PARTITION the data for us. We can instruct the query to restart the row numbering from 1 every time we come across a new CustomerId. Here’s the amended query to get the row numbers per client
SELECT CustomerID, OrderDate, OrderID
,ROW_NUMBER()
OVER(PARTITION BY CustomerID ORDER BY OrderId) AS rn
FROM Orders
We’ve changed the OVER clause to order the data by OrderId only and told it to PARTITION, or analyse the data, by each CustomerId. That query gives these results
Notice how the value of rn restarts from 1 for each new set of customer data.
Gotcha! Row Number is no longer unique
If I put that query into our CTE and run it I get … 45844 rows in our result set! Whoa! There are only 830 orders in total, what’s going on?
Remember my join to the “previous” row from the query…
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn – 1
That was assuming that each row number was unique – because we were only querying for a single CusomerId. Now we have as many row number 1s as we have CustomerIds. We need to amend that ON clause…
;WITH CTE AS
(
SELECT CustomerID, OrderDate, OrderID
,ROW_NUMBER() OVER(PARTITION BY CustomerID ORDER BY OrderId) AS rn
FROM Orders
)
SELECT CTE.CustomerID, CTE.OrderDate, ISNULL(DATEDIFF(dd, prev.OrderDate, CTE.OrderDate),0) AS DIFF
FROM CTE
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn - 1
AND CTE.CustomerID = prev.CustomerID
Now we get some sensible results – 830 rows that start with
You’ll need to something similar with our multiple-CTE example if you want to get the average days between orders for all the clients too (remembering that this was a contrived example to demonstrate multiple CTEs and not a good way to get this data).
;WITH CTE AS
(
SELECT CustomerID, OrderDate, OrderID
,ROW_NUMBER()
OVER(PARTITION BY CustomerID ORDER BY OrderId) AS rn
FROM Orders
), CTE2 AS
(
SELECT CTE.CustomerID,
ISNULL(DATEDIFF(dd, prev.OrderDate, CTE.OrderDate),0) AS DIFF
FROM CTE
LEFT OUTER JOIN CTE prev ON prev.rn = CTE.rn - 1
AND CTE.CustomerID = prev.CustomerID
)
SELECT CustomerID, 'Average', AVG(DIFF)
FROM CTE2
GROUP BY CustomerID
CTE Example 4 – Sequences – Generate lists
CTE queries can be used to generate sequences quite easily. There are several examples on this article http://www.codeproject.com/Tips/770879/Generating-a-Sequence-in-SQL.
If you combine a generated sequence with a CROSS JOIN you can generate things like roster for employees [^]
Imagine we want to generate a staff roster for the next two weeks.
We can use an rCTE to generate sequence of dates, starting from today for 14 days. If we then CROSS JOIN that small sequence with the Employees table we will get each row from the rCTE for each Employee. Like this:
;WITH CTE AS
(
SELECT GETDATE() AS datum
UNION ALL
SELECT DATEADD(DD, 1, datum)
FROM CTE
WHERE DATEADD(DD, 1, datum) < DATEADD(DD, 14, GETDATE())
)
SELECT CTE.datum, DATENAME(DW, datum), E.LastName
FROM CTE
CROSS JOIN Employees E
WHERE DATEPART(DW, datum) NOT IN (1,7)
ORDER BY E.LastName
Which yields the following results:
CTE Example 5 – Sequences – Find missing items
One neat trick with sequences of dates is to determine which dates are missing from a set of data. Similar to the example above we will use an rCTE to generate a sequence of all dates between a start date and an end date. The second query against it will LEFT OUTER JOIN to the Orders table. The results will be all dates where no orders were placed.
DECLARE @minDate DATE
DECLARE @maxDate DATE
SELECT @minDate = MIN(OrderDate), @maxDate = MAX(OrderDate) FROM Orders
;WITH CTE AS
(
SELECT @minDate AS datum
UNION ALL
SELECT DATEADD(DD, 1, datum)
FROM CTE
WHERE DATEADD(DD, 1, datum) <= @maxDate
), DateSeq AS
(
SELECT datum
FROM CTE
)
SELECT D.datum, DATENAME(DW, datum)
FROM DateSeq D
LEFT OUTER JOIN Orders O ON O.OrderDate = D.datum
WHERE O.OrderID IS NULL
OPTION (MAXRECURSION 0)
Tip – That semi-colon
When creating a CTE you may have other SQL statements before it, or after testing it you may want to insert the code for the CTE into a much larger procedure. You may then come across this error message:
Incorrect syntax near the keyword ‘with’. If this statement is a common table expression, an xmlnamespaces clause or a change tracking context clause, the previous statement must be terminated with a semicolon.
It’s quite an obvious error message and the solution is quite simple… just put a semicolon on the end of the line above the CTE declaration. But if you insert more code before the CTE you’ll have to remember to do that again. A common habit amongst MSSQL developers is to put the semicolon in front of the WITH as in the examples above.
Solving problems with SQL Window Functions
The techniques we used in the previous section will work with most versions of SQL Server, but one of the areas that have been continuously improved over releases is “Window Functions”.
These are often referred to as “OVER Functions” because of the keyword OVER that is used…
A window function is a function that’s applied to a set of rows defined by a window descriptor and returns a single value for each row from the underlying query. The purpose of the window descriptor is to define the set of rows that the function should apply to.[^]
These are a great way of still being able to get at the details of individual rows whilst also getting aggregate information.
I used the ROW_NUMBER() window function in the example above. By including the PARTITION clause I was able to apply the row numbering to each set of data for each Customer (without using a procedural loop). The “window descriptor” was CustomerID. We noted that if we omit the PARTITION BY then all of the rows in the table received a row number – i.e. without the window descriptor the set of rows affected was the entire table.
PARTITION BY divides the query result set into partitions. The window function is applied to each partition separately and computation restarts for each partition.[^]
That query worked well, and got the data we wanted but it would be nice if there was a more concise way of doing it – rather than having to generate extra data (ROW_NUMBER()), create a CTE and then having to do a self-join. Well for SQL Server 2012 (not the Express version) and later versions (including 2014 Express), additional SQL Window Functions are available to do just that.
This query gets exactly the same results as before
SELECT CustomerID, OrderDate,
ISNULL(DATEDIFF(dd, LAG(OrderDate, 1)OVER (PARTITION BY CustomerID
ORDER BY CustomerID, OrderId), OrderDate),0) AS DIFF
FROM Orders
Whereas before we used prev.OrderDate in the DATEDIFF function, now we’re using
LAG(OrderDate, 1) OVER (PARTITION BY CustomerID ORDER BY OrderId)
Notice how the OVER clause is exactly the same one I used for ROW_NUMBER() previously? That makes sense because we’re still interested in the data for each CustomerID, and we still want the Orders in OrderDate order. The row number has gone though and we’re not using any self-joins now so table alias prev
has also disappeared.
The LAG function is doing it all for me. That part of the query now says
Get the value of OrderDate from a preceding row | LAG(OrderDate, |
Going back one row from the current row in the collection | , 1) |
The collection should be partitioned on CustomerID I .e. don’t pick up the previous OrderDate if it wasn’t for the same CustomerID as the current row. | OVER (PARTITION BY CustomerID |
The order of the collection should be by CustomerID then OrderId | ORDER BY CustomerID, OrderId) |
The new query is much neater, and it’s actually easier to see what is going on. LAG (the forward looking equivalent is LEAD) is one of the Analytic Functions [^] provided in later versions of SQL Server.
Another example – remember one of the “requirements” from right at the start that can prompt you to think of using a loop?
“Work out the total and the average of all of the values for each product”
Let’s try to get details of all Products, with the total quantity sold per product, the average quantity per order per product and the number of orders of each product.
That looks quite difficult but it’s actually quite easy; I can use some of the SQL Aggregate functions and GROUP BY …
SELECT DISTINCT P.ProductID, p.ProductName,
COUNT(OD.OrderID) AS CountOfOrders,
AVG(Quantity) AS AvgPerOrder,
SUM(Quantity) AS TotalSold
FROM [Products] P
LEFT JOIN [Order Details] OD on OD.ProductID = P.ProductID
GROUP BY P.ProductID, p.ProductName
ORDER BY P.ProductName
But say I also have to show the Quantity Per Unit, the Units In Stock, the Units On Order and the Reorder Level in our results.
When we add those fields to our SELECT list we get errors
--Msg 8120, Level 16, State 1, Line 1 --Column ‘Products.QuantityPerUnit’ is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause. --Msg 145, Level 15, State 1, Line 1 --ORDER BY items must appear in the select list if SELECT DISTINCT is specified.
One solution is extend our GROUP BY clause to include ALL of the other fields we want to show…
SELECT DISTINCT P.ProductID, p.ProductName, QuantityPerUnit, UnitsInStock, UnitsOnOrder, ReorderLevel,
COUNT(OD.OrderID) AS CountOfOrders,
AVG(Quantity) AS AvgPerOrder,
SUM(Quantity) AS TotalSold
FROM [Products] P
INNER JOIN [Order Details] OD on OD.ProductID = P.ProductID
GROUP BY P.ProductID, p.ProductName, QuantityPerUnit, UnitsInStock, UnitsOnOrder, ReorderLevel
ORDER BY P.ProductName
OR we have to do some sort of sub-query and then a join…
SELECT DISTINCT SummaryData.ProductID, SummaryData.ProductName,
Prod.QuantityPerUnit, Prod.UnitsInStock, Prod.UnitsOnOrder, Prod.ReorderLevel,
SummaryData.CountOfOrders,SummaryData.AvgPerOrder,SummaryData.TotalSold
FROM (SELECT DISTINCT P.ProductID, p.ProductName,
COUNT(OD.OrderID) AS CountOfOrders,
AVG(Quantity) AS AvgPerOrder,
SUM(Quantity) AS TotalSold
FROM [Products] P
LEFT JOIN [Order Details] OD on OD.ProductID = P.ProductID
GROUP BY P.ProductID, p.ProductName) SummaryData
INNER JOIN [Products] Prod ON SummaryData.ProductID=Prod.ProductID
ORDER BY SummaryData.ProductName
It’s starting to look horribly messy!
However we can use OVER clauses with those aggregate functions instead (since SQL2008) and PARTITION the data into the windows we need getting rid of the GROUP BY clause altogether …
SELECT DISTINCT P.ProductID, p.ProductName, QuantityPerUnit, UnitsInStock, UnitsOnOrder, ReorderLevel,
COUNT(OD.OrderID) OVER(PARTITION BY P.ProductID) AS CountOfOrders,
AVG(Quantity) OVER(PARTITION BY P.ProductID) AS AvgPerOrder
,SUM(Quantity) OVER(PARTITION BY P.ProductID) AS TotalSold
FROM [Products] P
LEFT JOIN [Order Details] OD on OD.ProductID = P.ProductID
ORDER BY P.ProductName
Personally, I think that looks a lot neater – and it’s easier to work out what’s going on. Don’t forget – I still didn’t have to resort to any procedural loops to get the data I needed. I get all the information I need from a single query.
You can read more about SQL Window Functions here – [^].
Tabular Results (Rows to Columns) - Playing with Pivot
Another concept that beginners struggle with is when there is a requirement to tabulate a set of results.
Imagine we have been asked to get the total number of orders placed by country by year. It’s easy enough to get hold of the information
SELECT DATEPART(Year,OrderDate) AS OrderYear,
ShipCountry, COUNT(OrderID)
FROM Orders
GROUP BY DATEPART(Year,OrderDate), ShipCountry
But the results come out as one row per year, per country …,/p>
It’s not quite what we were after, something more like this would be appropriate
I could loop through our results building up a new table in the right format… but you know I don’t have to do that. I can do a simple PIVOT query [^] instead…
SELECT *
FROM
(
SELECT DATEPART(Year,OrderDate) AS OrderYear, ShipCountry, OrderID
FROM Orders
) AS source
PIVOT
(
COUNT(OrderID)
FOR OrderYear IN ([1996],[1997],[1998])
)AS pvt
ORDER BY ShipCountry
Notice that query that I’m calling source
– it’s very similar to our original query to extract the data, but we’ve lost the GROUP BY.
The PIVOT section is going to count the orders for me based on each year
FOR OrderYear IN ([1996],[1997],[1998])
i.e. the list of “columns” in the IN clause. Note those years are not strings, they represent the columns of the final result set). The PIVOT rotates the data based on that list.
In other words, SQL looks for values in OrderYear that match the names in the list of columns, and then counts the number of OrderIDs that match up (You can use other aggregate functions instead of COUNT). The results are partitioned based on the other items in the SELECT statement of source
. To demonstrate this, if we take ShipCountry out of the SELECT it will do the COUNT across the entire result set from source
giving the following results
Great – We can now easily produce tabulated from our data.
To summarise, to pivot data you need a source query … this should contain the field that you want to rotate the data on, and any fields that you want to partition the data by. Then you need the key word PIVOT followed by instructions on how to rotate the data. That section must use an aggregate function such as COUNT, MAX, MIN etc. The FOR clause will define the columns in the result table … only those columns will show, you don’t need to include every possible value if you are not interested in that information but, importantly if you don’t include all possible values you can inadvertently exclude information you really wanted.
Gotcha! – Column Names can’t be numbers
Because a column name can’t be all numeric we had to surround the “column” names with square brackets - [] to make it a valid column name. Hence
FOR OrderYear IN ([1996],[1997],[1998])
And not
FOR OrderYear IN (1996,1997,1998)
The same is true if values that you are trying to pivot on are reserved words in SQL, for example
FOR SomeData IN ([Name],[Date],[Sum],aValue)
And it is also true if values you are trying to pivot on contain spaces, for example
FOR Title IN ([Sales Representative],[Sales Manager])
This becomes quite important when you are generating the SQL from a dynamic query …
Dynamic SQL Queries and Comma Separated Lists
The problem with Pivot is that as time passes there will be Order Dates in subsequent years. That data will not be included in our query because we’ve hard-coded the columns we want to see… 1996, 1997 and 1998.
We really don’t want to have to update the query every year. It would be nice if we could generate that list from the data itself.
Getting the information is simple enough
SELECT DISTINCT DATEPART(Year, OrderDate) AS OrderYear from Orders
I need to get that list of years into a comma-separated list, complete with the square brackets around the numbers. What I don’t want to do though, is use a procedural loop to build up that list! Fortunately there are a couple of ways of getting around that.
We can use COALESCE [^] in a fairly simple query as below
DECLARE @listStr VARCHAR(MAX) = null
;WITH years AS
(
SELECT DISTINCT DATEPART(Year, OrderDate) AS OrderYear from Orders
)
SELECT @listStr = COALESCE(@listStr+'],[' ,'') + CAST(OrderYear AS Varchar)
FROM years
PRINT '[' + @listStr + ']'
Which give us our list of columns like this.
[1998],[1996],[1997]
Note I had to manually add in the first and last square brackets, and you don't initialise @listStr with the starting bracket (otherwise you end up with an empty column name [ ] – try it and see).
You can also use STUFF and XML FOR (since SQL 2005) to generate comma-separated lists but it's a little fiddly when trying to include the square brackets so we’ll keep it simple and use the technique above. You can see an example at Get a comma-separated list of values in SQL with FOR XML[^].
Now my problem is how to get that list of columns into my query. For this I’m going to use some Dynamic SQL.
Dynamic SQL is a term used to mean SQL code that is generated programatically (in part or fully) by your program before it is executed. As a result it is a very flexable [sic] and powerful tool. You can use dynamic sql to accomplish tasks such as adding where clauses to a search based on what fields are filled out on a form or to create tables with varying names[^]
I can place my SQL query into a string variable, using the @listStr variable that I derived above to list the columns I want. I’ve broken it down a bit here to make it a bit clearer.
DECLARE @sql NVARCHAR(MAX) =
N'SELECT * FROM (select DATEPART(Year,OrderDate) AS OrderYear'
SET @sql = @sql + N', ShipCountry, OrderID from Orders) as s '
SET @sql = @sql +
N'PIVOT (COUNT(OrderID) FOR OrderYear IN ([' + @listStr + N']) )'
SET @sql = N' AS pvt order by ShipCountry'
If you print out the contents of @sql you can see that we get the same query we had earlier (’ve added some spaces to make it more readable)
SELECT *
FROM
(
SELECT DATEPART(Year,OrderDate) AS OrderYear, ShipCountry,
OrderID from Orders) AS s
PIVOT (COUNT(OrderID) FOR OrderYear IN ([1998],[1996],[1997])
) AS pvt
ORDER BY ShipCountry
We can then run the query with this statement
EXEC sp_executesql @sql
That query will still work as more data is added, if data is added for 1999 @listStr will automatically pick up the requirement for the new column and [1998],[1996],[1997], [1999] will be inserted into the @sql variable.
Hint – notice I said “If you print out the contents of @sql you can see…”. It’s always a good idea to PRINT and examine the SQL you are generating before including the EXEC. You can copy the SQL from the Message Pane and run it directly to see if it works. It’s much easier to debug it that way than just running it straight away!
Gotcha! – how to restrict the number of columns shown
After some time, the results of the query above are going to get a little unwieldy. It’s more likely that the users are going to want to see only the last 5 years say of data (for example).
There is still no need to resort to a loop to do this, I can just restrict the data that goes into deriving the column list for the pivot.
It might look as if this query would be enough to limit the columns to the last 5 years
;WITH years AS
(
SELECT DISTINCT TOP 5 DATEPART(Year, OrderDate) AS OrderYear from Orders
)
SELECT @listStr = COALESCE(@listStr+'],[' ,'') + CAST(OrderYear AS Varchar)
FROM years
But if you look carefully at the query that was generated for the dynamic SQL above, the list of columns looks like this:
OrderYear IN ([1998],[1996],[1997])
They didn't come out in order at all. In fact
When TOP is used in conjunction with the ORDER BY clause, the result set is limited to the first N number of ordered rows; otherwise, it returns the first N number of rows in an undefined order[^]
So you must remember to add an ORDER BY clause.
Furthermore, the TOP 5
and the ORDER BY
must appear in the same part of the query. If I try to put the ORDER BY within the body of the CTE but the TOP 5 in the SELECT from the CTE like this:
;WITH years AS
(
SELECT DISTINCT DATEPART(Year, OrderDate) AS OrderYear from Orders ORDER BY OrderYear
)
SELECT TOP 5 @listStr = COALESCE(@listStr+'],[' ,'') + CAST(OrderYear AS Varchar)
FROM years
I’ll get an error:
The ORDER BY clause is invalid in views, inline functions, derived tables, subqueries, and common table expressions, unless TOP, OFFSET or FOR XML is also specified.
Either of these methods will work however:
;WITH years AS
(
SELECT DISTINCT TOP 5 DATEPART(Year, OrderDate) AS OrderYear from Orders ORDER BY OrderYear
)
SELECT @listStr = COALESCE(@listStr+'],[' ,'') + CAST(OrderYear AS Varchar)
FROM years
Or
;WITH years AS
(
SELECT DISTINCT DATEPART(Year, OrderDate) AS OrderYear from Orders
)
SELECT TOP 5 @listStr = COALESCE(@listStr+'],[' ,'') + CAST(OrderYear AS Varchar)
FROM years ORDER BY OrderYear
There is additional reading material on dynamic sql in this CodeProject article Two Handy Techniques for Creating Dynamic SQL in Stored Procedures[^].
Comma-Separated Lists in Groups
Imagine a scenario where you want comma-separated lists of values for a set of data - in other words you want to GROUP by a column
As an example, let’s get a list of the Products for each Order. The Product Ids are stored on the [Order Details] table and we can use FOR XML
and STUFF
to generate a list of those Ids by order ID.
If we put that information into a CTE we can get the rest of the details directly off the [Orders] table:
;WITH CTE AS
(
SELECT OrderID, products =
STUFF((SELECT ', ' + CAST(ProductID as nvarchar)
FROM [Order Details] p1
WHERE p1.OrderID = p2.OrderID
FOR XML PATH('')), 1, 2, '')
FROM [Order Details] p2
GROUP BY OrderID
)
SELECT CustomerID, OrderDate, CTE.products, CTE.OrderID
FROM [Orders] O
INNER JOIN CTE ON O.OrderID=CTE.OrderID
Which gives us the following results
Temporary Tables and Table Functions
You might end up with a situation where you have some really complex SQL that produces some data that you then want to use in a CTE or in some dynamic SQL or a really big set of joins. In these cases it is often worth considering using a Temporary table, a table variable or writing a function that will return a table to simplify the final query.
I’m not going to discuss these methods further here as they have been very well covered elsewhere. I’m only including this section to remind you that you can use tables to avoid loops. Have a read of the following articles for more information:
I definitely need a Loop!
There may come a time when you just have to use a loop. There are very few occasions that I can think of: Manipulating table schemas perhaps or splitting strings.
If you find yourself in this situation don’t just leap straight in to using SQL Cursors. There are alternatives…
Loop outside of SQL
Bear in mind that
T-SQL was not done with loop performance in mind[^]
Why would it be – it is set-based technology, so loops are probably more of an after-thought than an efficient way of processing data.
Many programming tools such as the .NET family of languages, Report generators etc., can do this quite efficiently. So do consider passing a dataset back to your application and allow that application to do the loop processing. This is especially true of any visual enhancements (e.g. add preceding zeros to a value) which should be left to the Presentation layer of your application.
While Loop within SQL
The WHILE loop [^] in SQL is usually used to control access to a cursor, but they can be used in their own right as well. This approach is useful if you want to avoid locking down a table while you traverse a set of records. Here is a contrived example
DECLARE @min int
DECLARE @max int
DECLARE @curr int
SELECT @min = MIN(ProductID), @max = MAX(ProductID)
FROM Products WHERE CategoryID = 2
SET @curr = @min
WHILE @curr <= @max
BEGIN
DECLARE @prodname NVARCHAR(125) = NULL
SELECT @prodname = ProductName
FROM Products WHERE ProductID = @curr AND CategoryID = 2
IF NOT @prodname IS NULL
BEGIN
PRINT 'Working with ' + CAST(@curr as varchar)
END
SET @curr = @curr + 1
END
For Loop within SQL
Actually I’m cheating here. There is no FOR loop in T-SQL currently, but you can simulate a FOR-loop using WHILE [^]
Here is an example of using a “For loop” to split a string based on a separator. Note - This is not my work - the example is taken directly from SQLServerCentral.com[^]
CREATE FUNCTION [dbo].[fnSplitString]
(
@string NVARCHAR(MAX),
@delimiter CHAR(1)
)
RETURNS @output TABLE(splitdata NVARCHAR(MAX) )
BEGIN
DECLARE @start INT, @end INT
SELECT @start = 1, @end = CHARINDEX(@delimiter, @string)
WHILE @start < LEN(@string) + 1 BEGIN
IF @end = 0
SET @end = LEN(@string) + 1
INSERT INTO @output (splitdata)
VALUES(SUBSTRING(@string, @start, @end - @start))
SET @start = @end + 1
SET @end = CHARINDEX(@delimiter, @string, @start)
END
RETURN
END
You can even use things like BREAK and CONTINUE within a loop to add functionality – but a caveat needs to be mentioned here – are you sure you should be doing this processing in your database or does it really belong in your business layer?
Cursors in SQL
Finally, of course, there are CURSOR loops in SQL. The disadvantages of Cursors have been discussed hundreds of times and it would be pointless for me to just repeat those arguments here. Instead, here are some articles for further reading.
You will hopefully have guessed by now that I don’t like Cursors!
Gotcha! Be especially wary of Infinite Loops
Anytime you use a WHILE loop, be it specifically for a WHILE construct, whether you are simulating a FOR loop or whether you are controlling the use of a CURSOR, be very careful to include the loop variable or condition on all paths through the code. To turn the example above into an infinite loop all I have to do is move one line to the wrong position and we have a problem:
DECLARE @min int
DECLARE @max int
DECLARE @curr int
SELECT @min = MIN(ProductID), @max = MAX(ProductID)
FROM Products WHERE CategoryID = 2
SET @curr = @min
WHILE @curr <= @max
BEGIN
DECLARE @prodname NVARCHAR(125) = NULL
SELECT @prodname = ProductName
FROM Products WHERE ProductID = @curr AND CategoryID = 2
IF NOT @prodname IS NULL
BEGIN
PRINT 'Working with ' + CAST(@curr as varchar)
SET @curr = @curr + 1
END
END
Once this loop hits a row it needs to skip over, the variable @curr is no longer incremented because that line of code is within the inner IF statement.
Summary and Further Reading
I hope I’ve demonstrated that 99 times out of 100 you don't really need to use an actual loop when dealing with SQL queries.
I’ve restricted all of my examples to using SQL Queries in Management Studio. Specifically I haven’t even touched upon the use of Data Warehousing, Business Intelligence or Cubes. If you are interested in reading further then you might want to start with these introductions:
Here are some suggestions for further reading on Performance considerations as promised.
With an article of this type there is not a lot of original thinking, just some experience and lots of research. I’ve attempted to attribute all my sources in the body of the article, but if you think I’ve missed one, or have a better example, please let me know in the comments section at the end and I will remedy the omission.
History
Initial version - 23 March 2017
Added an example of CSV lists - 31 March 2017
Slowing down. Now working as a "Business Automation Consultant".
"Abandon all hope of ever being understood, all ye who have a good idea but don't say it perfectly." Doug Bernard