Click here to Skip to main content
15,868,016 members
Articles / Database Development / SQL Server
Article

SubStringInstance SQL CLR - Find the nth Instance of a String

Rate me:
Please Sign up or sign in to vote.
4.41/5 (5 votes)
14 Dec 2013CPOL7 min read 21.7K   112   5   2
SQL CLR function to find the nth instance of a string within a string

Introduction

This article shows how to build a SQL Server CLR function that returns the nth instance of a substring within a string. This is useful for extracting sub-keys from compound surrogate keys in data warehouses and OLAP cube data staging situations. This example can also be easily applied to other string and custom CLR function implements that can make SQL much easier to use and make your SQL scripts much more readable and understandable.

Background

While working on a data warehouse project for a large telecommunications firm, I encountered database records that used a compound primary key for uniquely identifying individual records that were organized in a hierarchal form as they were generated from a hierarchy of telecommunications network equipment.

The network hierarchy could be simply represented as something like this:

  • Master Network Server (MNS)
    • Regional Network Servers (RNS)
      • City Network Servers (CNS)
        • Specific Network Server (SNS)

Thus a network performance record from the lowest level (SNS) servers (the ones that do all the work) would have a record somewhat like this:

TimeStampMNS_IDRNS_IDCNS_IDSNS_IDMetric_TypeValue (%)
2013-12-10 15:46:56.000MNS01RNS07CNS-ATLSNS12CPU Load52.6

While a simple large integer sequence (such as with an identity column) or a GUID could have been used for a unique record key in the data warehouse, a compound key using the network server hierarchy was used. This has some advantages such as being human readable and understandable as well as being able to be used more easily downstream for ETL and OLAP cubes. We won’t discuss or debate the pros and the cons of the various unique key approaches here as they are beyond the scope of this article and in this situation I had no control over the compound key form used.

Thus, the compound key used was a concatenation of the server network hierarchy:

Compound Server Key = MNS_ID + RNS_ID + CNS_ID + SNS_ID

A character, sometimes a dash and sometimes an underscore, were used to delimit the server IDs for different tables. (I know, not good, consistent practice but that’s what you find when multiple people in multiple locations over time do their own implementations.)

And when used with the record Time_Stamp and the Metric_Type became the unique primary key of the database records table:

Unique Record Key = Time_Stamp + Compound Server Key + Metric_Type

This was very easy to load in SQL of course by simply using string concatenation to assemble the server IDs from their separate categories and insert the value into a new column:

TimeStampServer_KeyMetric_TypeValue (%)
2013-12-10 15:46:56.000MNS01_RNS07_CNS-ATL_SNS12CPU Load52.6

So here, we have a four level/item compound server key that represents the concatenation of four individual columns. In practice, in my situation, the server key was typically comprised of six levels, often of variable length, and occasionally had other strings concatenated as well. The variable length of the server key sub-fields makes this a difficult exercise otherwise we could just use SUBSTRING with fixed positions and lengths to easily retrieve all of the server names in the hierarchy.variety of manners including:

  1. Calculating delimiter indexes using CHARINDEX then using SUBSTRING to extract the server strings
  2. Use a “Numbers” or “Tally” table technique
  3. Use an ordered split function in SQL
  4. Using a SQL CLR Regex function
  5. Using some other awkward and complex SQL technique

Here, for this article, I will show the CHARINDEX and SUBSTRING approach as it is straightforward and easily understandable.

SQL
DECLARE @Server_Key varchar(50); 
DECLARE @Server_Key_Length int; 
DECLARE @First_Index int; 
DECLARE @Second_Index int; 
DECLARE @Third_Index int; 
    
SET @Server_Key = 'MNS01_RNS07_CNS-ATL_SNS12'; 
SET @Server_Key_Length = LEN(@Server_Key); 
SET @First_Index = CHARINDEX('_', @Server_Key); 
SET @Second_Index = @First_Index + CHARINDEX('_', _
SUBSTRING(@Server_Key, (@First_Index + 1), LEN(@Server_Key))); 
SET @Third_Index = @Second_Index + CHARINDEX('_', _
SUBSTRING(@Server_Key, (@Second_Index + 1), LEN(@Server_Key))); 
    
-- Debug Scaffolding: 
PRINT 'Server_Key: ' + @Server_Key; _
PRINT 'Server_Key_Length: ' + CONVERT(VARCHAR(5), @Server_Key_Length); 
PRINT 'First_Index: ' + CONVERT(VARCHAR(5), @First_Index); 
PRINT 'Second_Index: ' + CONVERT(VARCHAR(5), @Second_Index); 
PRINT 'Third_Index: ' + CONVERT(VARCHAR(5), @Third_Index); 
    
SELECT @Server_Key; SELECT SUBSTRING(@Server_Key, 0, @First_Index - 1) as MNS; 
SELECT SUBSTRING(@Server_Key, @First_Index + 1, @Second_Index - @First_Index - 1) as RNS; 
SELECT SUBSTRING(@Server_Key, @Second_Index + 1, @Third_Index - @Second_Index - 1) as CNS; 
SELECT SUBSTRING(@Server_Key, @Third_Index + 1, LEN(@Server_Key)) as SNS; 

In this T-SQL code, we first calculate the three index values of the underscore delimiter in the Server_Key column using the CHARINDEX function and hold those values in declared variables. We then use these calculated indexes with the SUBSTRING function to extract the individual server names. Debug scaffolding PRINT statements are included so you can see the value of the calculated indexes.

Note that in Oracle PL/SQL, instead of the CHARINDEX function, the INSTR (Instring) function would be used. The INSTR function has an advantage over the T-SQL CHARINDEX function in that it has an extra parameter to retrieve the nth instance instead of just the first instance of the search string. This would simplify this code a bit for an Oracle implementation but the SQL would still be very similar to this.

Other approaches, as mentioned above, would be more complex and would often consist of significantly more SQL code. The above approach has the advantages of straightforwardness, simple to read and to understand, and should be maintainable by novice database developers.

However, I find this SQL approach to be sub-optimal. In writing C# or other code, we’d just make a new code function to do this for us. This new code function would encapsulate the desired functionality in hopefully a generic manner so we could readily reuse it elsewhere. While SQL Server has user defined functions, and this SQL could be encapsulated into one, we can use the very useful SQL Server CLR functionality to do exactly what we need. This results in reduced SQL code (and possible errors), having to tailor the SQL to exactly the number of parameters that we want to extract, make the simplified SQL easily readable and understandable, and also gives us a readily reusable new SQL function.

The Solution

SQL Server has a very useful feature called SQL CLR (Common Language Runtime) functions which allow a developer to write C# code to create a custom function. There are over a dozen CLR articles here on CodeProject so I won’t belabor or repeat their step-by-step content. I will just show you the simple code and scripts that can be used to implement your own SQL CLR function. And from this simple model, you can build other SQL CLR functions to solve problems and to optimize code (and performance) in the data transformation challenges that you encounter.

First, I decided on how I wanted the SQL CLR function to operate. Using the Oracle INSTR function as a model, I decided upon a generically useful function using three parameters.

Syntax is:

SubStringInstance ( Input_String, Delimiter_Char, Instance_Number_To_Return )

C# Code

Here is the C# code:

C#
/////////////////////////////////////////////////////////////////////////////////
// SubStringInstance - Returns the Nth instance of a substring in a delimited
//                     string.  Similar to the Oracle INSTR function.
//
// Grant Anderson
// 12/12/2013
//
// Note:  To install this you have to add the DLL as an assembly to the database
//        and run the following SQL so that you create a SQL function wrapper for
//        it with the proper parameter and return types (Note the special para-
//        meter and return type designations here that need to match the code!).
//
//	CREATE Function SubStringInstance(@InputString NVARCHAR(4000), @Delimiter NCHAR(1), 
//		@InstanceNumber INTEGER )
// 	RETURNS NVARCHAR(4000) 
// 	EXTERNAL NAME SubStringInstance.
//  [SubStringInstance_CLR_Functions.SubStringFunctions].SubStringInstance
//
/////////////////////////////////////////////////////////////////////////////////
using System;
using System.Collections;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
namespace SubStringInstance_CLR_Functions
{
	public class SubStringFunctions
	{
		[Microsoft.SqlServer.Server.SqlFunction]
		public static SqlString SubStringInstance
			(string InputString, char Delimiter, int InstanceNumber)
		{
			string returnString = string.Empty;
			string[] words = InputString.Split(Delimiter);
			int count = 0;
			foreach (string word in words)
			{
				count++;
				if (count == InstanceNumber)
				{
					returnString = word;
					break;
				}
			}
			return new SqlString(returnString);
		}
	}
}

You will need to compile this into a DLL (and not an executable).

This is very simple code. There are only a few differences from a regular C# program:

  1. The two special Usings System.Data.SqlTypes and Microsoft.SqlServer.Server
  2. The [Microsoft.SqlServer.Server.SqlFunction] declaration
  3. The SqlString data type which is required to interface with SQL Server

The code simply uses the Split() function to chop up the input string into its component sub-strings using the delimiter character and puts them into a string array. It then iterates through that string array to find and return the nth instance. Simple and to the point. And it automatically adapts to differences and variations in the length of the sub-string elements.

Note that the project solution here is set for a 64-bit for a 64-bit SQL Server.

SQL Scripts

You will need a SQL script to create a User Defined Function (UDF) wrapper in SQL Server in order to use SubStringInstance in a database:

SQL
CREATE Function SubStringInstance(@InputString NVARCHAR(4000), _
@Delimiter NCHAR(1), @InstanceNumber INTEGER )
RETURNS NVARCHAR(4000) 
EXTERNAL NAME SubStringInstance._
[SubStringInstance_CLR_Functions.SubStringFunctions].SubStringInstance

Note the special parameter and return type designations here that need to match the code!

Deployment

Here are the steps to deploy and install the SubStringInstance DLL to a SQL Server:

Step 1 - Check that CLR is enabled on the server by running this script in SQL Management Studio:

SQL
SELECT *
FROM sys.configurations
WHERE name = 'clr enabled';

A value of zero means it is not enabled. A value of one means that it is enabled.

If it is not enabled, then enable CLR on the SQL Server by using these commands:

SQL
SP_CONFIGURE 'clr enabled', 1
GO
RECONFIGURE
GO

Step 2 - Copy the SubStringInstance.dll file to the server.

Step 3 - Add the assembly to the SQL Server database via the CREATE ASSEMBLY command.

SQL
CREATE ASSEMBLY SubStringInstance 
FROM 'C:\SubStringInstance.dll'
WITH PERMISSION_SET = SAFE;

Note: The assembly must be on the server file system for this command to work.

Step 4 - Create the assembly User Defined Function (UDF).

SQL
CREATE Function SubStringInstance(@InputString NVARCHAR(4000), _
@Delimiter NCHAR(1), @InstanceNumber INTEGER )
RETURNS NVARCHAR(4000) 
EXTERNAL NAME SubStringInstance.[SubStringInstance_CLR_Functions.SubStringFunctions].SubStringInstance

Step 5 - The UDF CLR assembly is now ready to use.

Test it with something like this:

SQL
select dbo.SubStringInstance('MNS01_RNS07_CNS-ATL_SNS12', '_', 2)

Returns:

CNS-ATL

Note: You must run the above SQL from the same database in which the UDF function resides or use a fully qualify naming for the location of the SubStringInstance UDF function.

Improvement

Our previous complex SQL is now transformed into a compact and easy to use form by using the SubStringInstance CLR DLL and UDF.

SQL
DECLARE @Server_Key varchar(50);
SET @Server_Key = 'MNS01_RNS07_CNS-ATL_SNS12';
SELECT @Server_Key;
SELECT dbo.SubStringInstance(@Server_Key, '_', 1) as MNS;
SELECT dbo.SubStringInstance(@Server_Key, '_', 2) as RNS;
SELECT dbo.SubStringInstance(@Server_Key, '_', 3) as CNS;
SELECT dbo.SubStringInstance(@Server_Key, '_', 4) as SNS;

Other Uses

SQL CLR functions can be used for a wide variety of uses which augments and extends regular SQL functionality. You can easily use the techniques explained here and in other SQL CLR articles here on CodeProject to find other uses for SQL CLR and make your SQL more functional, more robust, and more maintainable.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect AT&T
United States United States
Grant is a multi-dimensional individual that counts amongst his talents architecture, design, development, and training for software, databases, data warehouses, and OLAP cubes using Microsoft .NET and C# and the Microsoft BI stack. Grant has a Master's degree in Transpersonal Psychology and is currently writing a book on Software Psychology which explains that how we think determines software success (or failure) more than tools, technology, or developer skills.

Comments and Discussions

 
QuestionArray indexer Pin
Richard Deeming16-Dec-13 2:52
mveRichard Deeming16-Dec-13 2:52 
AnswerRe: Array indexer Pin
C Grant Anderson17-Dec-13 5:49
professionalC Grant Anderson17-Dec-13 5:49 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.