Database Normalization

Gaston Verelst

5.00/5 (2 votes)

May 23, 2016

CPOL

8 min read

10237

Introduction After I published last week’s article somebody asked me why I stated that (in this case) the problem didn’t originate in having a function with logic, but merely having to work with a database that is not normalized. So let’s talk … Continue reading →

Introduction

After I published last week’s article somebody asked me why I stated that (in this case) the problem didn’t originate in having a function with logic, but merely having to work with a database that is not normalized. So let’s talk about database normalization.

Smells

We wrote a T-SQL user function as a C# function. Let’s show this user function here again:

ALTER FUNCTION [dbo].[GetNextSalesOrderCode]

(

@Season nvarchar(3),

@Prefix nvarchar(3),

@RepresentativePrefix nvarchar(2)

)

RETURNS nvarchar(50)

BEGIN

DECLARE @Code as nvarchar(50)

DECLARE @PrevCode as nvarchar(50)

declare @MinSoCode as int

SELECT top 1 @MinSoCode = C.MinSoCode

FROM dbo.RepresentativeComputers C

INNER JOIN dbo.Representatives R ON C.RepresentativeComputer_Representative = R.Id

WHERE (R.Prefix = @RepresentativePrefix) AND (C.ComputerName = N’ERP’)

SELECT top 1 @PrevCode = Right(SO.Code,5)

FROM dbo.SalesOrders SO

INNER JOIN dbo.Representatives R ON SO.SalesOrder_Representative = R.Id

where SUBSTRING(SO.Code,4,3)= @Season

and R.Prefix=@RepresentativePrefix

and cast(Right(SO.Code,5) as int)>=@MinSoCode

order by Right(SO.Code,5) DESC

if @PrevCode is null

BEGIN

set @Code= @Prefix+‘.’+ @Season + ‘-‘ + @RepresentativePrefix + FORMAT(@MinSoCode,‘00000’)

END

ELSE

BEGIN

set @Code= @Prefix+‘.’+ @Season + ‘-‘ + @RepresentativePrefix + FORMAT(CONVERT(int, @PrevCode)+1,‘00000’)

END

RETURN @Code

END

Looking at this function we see a couple of things:

The function does more than 1 thing. It will first find a record in the table RepresentativeComputers, hereby relying on a magic string ‘ERP’. This is for a small company, with a limited number of sales representatives so they enter some data in the tables manually. This is not necessarily bad but it has to be documented somewhere. Also, having a T-SQL function that needs multiple queries to do its work is not always bad, but it is a (light) red flag already. Just to nitpick a bit: using “top 1” here is a bit dangerous without an “order by” clause. You’ll never know which “first” row it will take as this may change when the query is adapted, or indexed are modifed, or other changes happen in the database. In this case it seems that there will only be 1 row returned anyway, so the “top 1” can be safely removed.
The select statement over the SalesOrders table is worse. We see that the where clause is pretty complex, using SUBSTRING, RIGHT and CAST functions. We’ll see in a moment why this is a big red flag.
In the last part a new code is composed coming from the result of the query over the sales orders.
No error checking. This is not the topic of this post, so I won’t elaborate on this. If you want to know more check out try / catch in T-SQL.

More in detail

SELECT top 1 @PrevCode = Right(SO.Code,5)

FROM dbo.SalesOrders SO

INNER JOIN dbo.Representatives R ON SO.SalesOrder_Representative = R.Id

where SUBSTRING(SO.Code,4,3)= @Season

and R.Prefix=@RepresentativePrefix

and cast(Right(SO.Code,5) as int)>=@MinSoCode

order by Right(SO.Code,5) DESC

Using SQL Server Management Studio we create a new query and display the actual execution plan:

Filling in some actual values for the @variables in the query we obtain this plan:

The main important thing to notice here is that on the right there is a Clustered Index Scan over [SalesOrders]. This means that we are actually doing a full table scan using the clustered index. In this case there are only about 45K records in the table, but this will grow in the future and start to create problems.

Let’s review and apply Codd’s database normalization rules.

Who is this Codd?

Edgar F Codd.jpg Wikipedia has a nice page about Edgar F.Codd, explaining a bit more about his achievements. I think he is mostly known for his database normalization rules.

From Wikipedia:

Database normalization, or simply normalisation, is the process of organizing the columns (attributes) and tables (relations) of a relational database to minimize data redundancy.

Normalization involves decomposing a table into less redundant (and smaller) tables without losing information, and then linking the data back together by defining foreign keys in the old table referencing the primary keys of the new ones. The objective is to isolate data so that additions, deletions, and modifications of an attribute can be made in just one table and then propagated through the rest of the database using the defined foreign keys.

There are (initially) 3 normal forms that can be applied over a first model of the database. They are called conveniently 1NF, 2NF and 3NF. After 3NF more normal forms can be applied but this usually only has an academic use. When the database is in 3NF it will be optimized for OLTP use.

First Normal Form

From Wikipedia:

First normal form (1NF) is a property of a relation in a relational database. A relation is in first normal form if and only if the domain of each attribute contains only atomic (indivisible) values, and the value of each attribute contains only a single value from that domain. The first definition of the term, in a 1971 conference paper by Edgar Codd, defined a relation to be in first normal form when none of its domains have any sets as elements.

Looking back at our example we see that the Code field is actually composed of 2 fields and a prefix:

The prefix is always ‘SO’. So there is no need to store this in the database (unless it would change later, and then it will become a separate field anyway).
SUBSTRING(SO.Code,4,3) contains the code for the season, always 3 characters.
Right(SO.Code,5) contains the actual code, and this is the field that needs to be calculated in our T-SQL function.

Splitting the Code field into season and Code would make the query simpler already, it would become something like:

SELECT top 1 Code

FROM dbo.SalesOrders SO

INNER JOIN dbo.Representatives R ON SO.SalesOrder_Representative = R.Id

where Season = ‘151’

and R.Prefix=9

and Code>=2001

order by Code DESC

Having an index on the Code field and the Season field would improve the performance of this query and make it scalable.

This doesn’t mean that everything must be split in separate fields! For example, there is not much use of splitting a date in Year, Month, Day fields (unless you have some very specific needs for that).

Second normal form

Second Normal Form (2NF): No field values can be derived from another field.

We would fall in this trap if we created a SOCode field, that would contain the full code (‘SO.@Season.@Code’), so doing the reverse of what happened in our test database. In this case when one of the 2 fields is modified the SOCode field must be modified as well. Of course this can be done in a couple of ways in SQL Server, but it is usually better to calculate this in the client. Here are some ways that we can indeed solve this, but I will not explain them further in this post:

Create insert / update triggers that will automatically update the SOCode field.
Revoke insert / update permissions on the table and only allow insert / update operations via stored procedures, in which you calculate the SOCode field.
Create a view / table function over the table with the SOCode field as an extra (calculated) field.
Create a Computed Column.

As you can see there are some ways to help you out, but they all require processing at the database level. And often the database is already the bottleneck for performance. So it is better to perform these (easy) calculations on the client side, if possible. Another side effect is that there is redundant data in the table, which can be good for reporting, but not for an OLTP database.

Another example of this would be calculated fields like Total = price * quantity that are stored in the database.

Third normal form

Third Normal Form (3FN): No duplicate information is permitted.

In the modified table the Season field is stored directly in the SalesOrders table. When more information about seasons would be required (maybe a time period); an additional table needs to be created to accommodate for this. This will enforce referential integrity in the database.

When should we NOT normalize?

When the database is in 3NF it will be optimized for OLTP use. OLTP means that a lot of updates are done in the database. So the full range of CRUD operations is executed. If you want to run reports over your database then having many separate tables can be a problem. You’ll typically want to denormalize your database to simplify queries for reporting. And often a separate database is created for this, so you have a solution with an OLTP database for entering / updating data, and a denormalized database for querying. If you want to perform more advanced queries you’ll end up at the other end, being OLAP databases.

Conclusion

Most experienced developers will probably perform the normalization steps automatically. And I suppose that most of you (or I) don’t know by heart which is 1NF, 2NF or 3NF, but we are capable of building good databases.

Usually a database starts easy and normalized, and when it grows; a moment will come that shortcuts are taken. Eventually you’ll need to normalize the database anyway, be it for space requirements, or (probably) for performance requirements. So it is best to do this right away and think about your database changes beforehand. Don’t forget that in line of business applications the database plays the most important role.

References

Edgar F.Codd

Database Normalization

Normalization of Database