Click here to Skip to main content
15,867,568 members
Articles / Programming Languages / C#
Tip/Trick

SSIS SCD2 Script Lookup

Rate me:
Please Sign up or sign in to vote.
5.00/5 (4 votes)
25 Feb 2016CPOL4 min read 18.1K   300   7   1
C# Script in SSIS which can be used to replace standard (slow) SSIS SCD2 Lookup

Introduction

If you ever tried to create SCD2 lookup using standard Microsoft SSIS components (Lookup and Merge), you probably noticed that it quickly becomes the biggest bottleneck in your Data Flow. Although Merge seems reasonably quick, sometimes it can't be used because of the volumes of data in two tables and data itself (a lot of SCD2 changes per record).

The following picture from this blog shows just how slow Lookup and Merge components can be comparing to Script Lookup which I will present in this trick:

Image 1

There are different approaches you could use to solve this performance issue:

  • Convince your client that they actually don't need SCD2 type dimension
  • Multicast your data flow onto several Lookup components (parallel execution)
  • Buy some fancy SSIS component pack which provides multi-thread SCD2 lookup component
  • Etc.

It this tip, I will describe an approach which uses standard Microsoft SSIS components and one very clever piece of code I found in the above mentioned blog which I modified to use any type of column.

Background

Because this is not an article, I won't bother you too much with explaining the code. If you want to completely understand the power behind the code, please familiarize yourself with SCD2, C# Dictionaries and B-tree structures.

Using the Code

Script lookup is used in a similar way as standard Lookup component. After script component is placed in Data Flow and connected to data source above it, copy the attached C# code and paste it into Script component. Note that code is written in C#.

Note that you will need to manually create output columns on this script component.

The only thing left is to modify few lines of code at the beginning of the script. Let's see how to do this and what each configurable line means.

Let's take an example from the attached code:

C#
// START CHANGE
public Dictionary<String, String> SurrogateKeys = new Dictionary<String, String>{
            { "AccountNumber", "AccountNumber"} // Column as ColumnName - 
						// you can define different output name
};
public readonly Dictionary<string, string> Mappings = new Dictionary<string, string> 
		{ { "CardId", "CardNumber" } }; // FactColumn then DimColumn
public readonly String[] DimDateRangeKeys = { "ValidFrom", "ValidTo" }; // From - To
public const String FactDateKey = "ViewStart";
public const String TableName = "[dbo].[DimCard]";
public readonly bool IsDateRangeOverlapped = false;
public readonly bool IsDateTimeMode = false;
// END CHANGE

The best way to describe each variable usage is through the example. Let's say our Data Flow uses one source of data (OLE DB Source for example) and it reads all columns from fact table named Fact1. T-SQL equivalent would be:

SQL
SELECT F.*
FROM Fact1 F

So, let's start with the variables.

TableName

This variable represents the name of the (dimension) table which we want to use in lookup and which contains historic (SCD type 2) data. In our case, it is table DimCard.

Mappings

Similar to Lookup component, we need to map input columns with columns from lookup table in script component. In our case, we are joining input data flow (Fact1) with DimCard table using CardId column from fact table and CardNumber column from DimCard table. T-SQL equivalent would be:

SQL
SELECT F.*
FROM Fact1 F
JOIN DimCard C on C.CardNumber = F.CardId

DimDateRangeKeys & FactDateKey

As SCD2 is time dependent, we need to define date or datetime columns we want to use to join fact and dim (lookup) tables.

SCD2 dimension table must have from and to date columns which represent record validity period. These two columns are then stated in DimDateRangeKeys.

FactDateKey represent column in fact table used to "probe" dimension table.

T-SQL equivalent would be:

SQL
SELECT F.*
FROM Fact1 F
JOIN DimCard C on C.CardNumber = F.CardId and F.ViewStart between C.ValidFrom and C.ValidTo

SurrogateKeys

These are output (lookup) columns from the script component. The first element is column name from lookup table and the second is output alias.

For example:

SQL
SELECT F.*, C.AccountNumber as AccountNumber
FROM Fact1 F
JOIN DimCard C on C.CardNumber = F.CardId and F.ViewStart between C.ValidFrom and C.ValidTo

IsDateRangeOverlapped

Sometimes, SCD2 type dimension ValidTo value from previous and ValidFrom value from current record can overlap. In that case, you need to set this parameter to true to avoid duplication of data.

IsDateTimeMode

ValidFrom and ValidTo in dimension tables are usually of date type, but some systems use datetime column type. In that case, you need to set this parameter to true to make script work properly.

Points of Interest

I have tested this code several times and it is significantly faster than standard Lookup component.

A small downside of the attached code is that it works only in Full Cache mode as I didn't have time to implement Partial Cache. Thus, you need to be careful when using it - if your lookup (dimension) table has many rows, component might take a lot of memory during package execution and performance might degrade. If you are good in C# and you managed to understand the code, you could fix this by implementing partial cache.

A big downside in using this approach is maintenance. What if code has a bug and you already used it in 37 places in your project? You could search and replace piece(s) of code, but script components are validated only when it's script editor is 'closed'. At the time of writing this tip, I haven't found a good way of updating (refreshing) script components. This can be solved easily if you are able to find "Script Component Refresher" or if you create a Custom SSIS Component.

History

  • December 2015 - Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Australia Australia
Business Intelligence Professional passionate about turning data into meaningful information and helping business grow and make better "data-driven" decisions.

Comments and Discussions

 
-- No messages could be retrieved (timeout) --