SSE accelerated case-insensitive substring search

anlarke

4.89/5 (14 votes)

May 13, 2012

CPOL

7 min read

29977

285

Another take on the classic, now using SSE instructions

Download source - 2.6 KB

Introduction

I have always wondered why there is a strstr function available in C/C++ (provided by the CRT library), but not the expected case-insensitive stristr function (or in wide-character set fashion: wcsistr).

Some searching on the web yields enough results though, but most of them poorly written and usually very slow. The most basic implementations, which give correct results, first convert both strings to lower-case (or upper-case) and then perform a regular substring search. Unfortunately the memory allocation and the case conversion routine make them painfully slow. Some of the faster alternatives use assembly code, like this one by Ralph Walden. Unfortunately it cannot be used in 64-bit builds because of the inline assembly code and cannot account for a changing locale. Other approaches follow an entirely different path and try to accelerate searches by using an advanced string search algorithm, like Knuth–Morris–Pratt (used in the GNU C library nowadays) or Boyer–Moore. However, such an algorithm is error prone and may cause a significant overhead when searching small strings.

Background

A basic implementation

Let's first write a function which performs reasonably well without optimization and gives correct results with regard to unicode characters and the selected locale.

As a starting point we'll look up the regular substring search function in the Microsoft CRT library. The file we're looking for is probably located on your computer at C:\Program Files\Microsoft Visual Studio XX\VC\crt\src\intel\wcsstr.c. The function is split into three major parts: the first branch executes when SSE4.2 is detected, the second for SSE2 and the last part when none of these is present.

Take a look at the last part of the function, which is a very basic implementation of the regular substring search in plain C. Remember: wcs1 is the string to be searched, wcs2 is the substring that is being searched for. With some slight modifications to the code we can obtain a case-insensitive version.

The first character of the substring is converted to both lower and upper case and its values are stored in l and u. The outer for-loop traverses through the string and compares the characters to l and u. If the character matches, the rest of the substring is compared by converting the characters of both string and substring to lower case on the fly in the inner while-loop.

const wchar_t* wcsistr(const wchar_t *wcs1, const wchar_t *wcs2)
{
    const wchar_t *s1, *s2;
    const wchar_t l = towlower(*wcs2);
    const wchar_t u = towupper(*wcs2);
    
    if (!*wcs2)
        return wcs1; // an empty substring matches everything
    
    for (; *wcs1; ++wcs1)
    {
        if (*wcs1 == l || *wcs1 == u)
        {
            s1 = wcs1 + 1;
            s2 = wcs2 + 1;
            
            while (*s1 && *s2 && towlower(*s1) == towlower(*s2))
                ++s1, ++s2;
            
            if (!*s2)
                return wcs1;
        }
    }
 
    return NULL;
}

So there we have it, an already quite usable case-insensitive substring search function. Although the function performs quite well, it is still about 8~10 times slower than it's regular counterpart wcsstr. It becomes clear strong optimization is required.

Important note: The case conversion of the characters is done using the functions towlower and towupper. For these functions (and other CRT functions involving case conversions, like _wcsicmp) to work properly with unicode characters, you should set the locale using _wsetlocale. You can set the default system locale using:

_wsetlocale(LC_ALL, L"");

Cached case conversions

Closer inspection of the function makes it clear the functions towlower and towupper are critical for performance, unfortunately they are rather slow. Luckily the issue can be resolved by using a caching mechanism. A fair assumption is that most text processed by the search function will likely be of a western language, which basically means only character codes in the range 0-255 are used extensively.

Elaborating this idea further leads to the introduction of three new map-functions, which should replace their non-map counterparts. You are now required to call the _wsetlocale_map function at the start of your program, because in addition to setting the locale it will initialize the cache.

wchar_t g_mapToLower[256];
wchar_t g_mapToUpper[256];
 
wchar_t* _wsetlocale_map(int category, const wchar_t *locale)
{
	wchar_t *ret = _wsetlocale(category, locale);
	wchar_t c;
	
	for (c = 0; c < 256; ++c)
	{
		g_mapToLower[c] = towlower(c);
		g_mapToUpper[c] = towupper(c);
	}
 
	return ret;
}
 
wchar_t towlower_map(wchar_t c)
	{return (c < 256 ? g_mapToLower[c] : towlower(c));}
 
wchar_t towupper_map(wchar_t c)
	{return (c < 256 ? g_mapToUpper[c] : towupper(c));}

By using this caching mechanism the speed improved drastically, the case-insensitive search is now about 3.1~3.4 times slower than wcsstr (in case of a western language). Further optimization should be sought outside the realm of the C/C++ language.

Detecting SSE availability

Before we can use the SSE2 instructions we must make sure it is available on the system, this can be done easily using the __cpuid intrinsic. This should only be done once and since the locale will generally not change that frequently we modify the _wsetlocale_map function to avoid having an additional initialization function.

#define __ISA_AVAILABLE_X86     0
#define __ISA_AVAILABLE_SSE2    1
#define __ISA_AVAILABLE_SSE42   2

int __isa_available = __ISA_AVAILABLE_X86; 
 
wchar_t* _wsetlocale_map(int category, const wchar_t *locale)
{
    wchar_t *ret = _wsetlocale(category, locale);
    wchar_t c;
    int CPUInfo[4];
    
    for (c = 0; c < 256; ++c)
    {
        g_mapToLower[c] = towlower(c);
        g_mapToUpper[c] = towupper(c);
    }
 
    __cpuid(CPUInfo, 1);
    if (CPUInfo[2] & (1 << 20))
        __isa_available = __ISA_AVAILABLE_SSE42;
    else if (CPUInfo[3] & (1 << 26))
        __isa_available = __ISA_AVAILABLE_SSE2;
    
    return ret;
}

Optimizing the outer loop using SSE2 instructions

To optimize the testing of characters in the outer for-loop we first load three XMM registers. Each register has 8 available places for a character (128 bit register / 16 bit character = 8 places). The first register is loaded with l, the second with u and the third with zeros (the string terminating null character).

The string is now traversed in chunks of 8 characters, in each of these chunks we can quickly detect whether an interesting character is present by generating a mask. The provided comments in the following function should outline the procedure.

//
// These macros come from the Microsoft CRT library (look up wcsstr.c)
//

#define XMM_SIZE sizeof(__m128i)
#define XMM_ALIGNED(p) (0 == ((XMM_SIZE-1) & (intptr_t)(p)))
#define XMM_CHARS (XMM_SIZE / sizeof(wchar_t))
 
#define PAGE_SIZE ((intptr_t)0x1000)
#define PAGE_OFFSET(p) ((PAGE_SIZE-1) & (intptr_t)(p))
#define XMM_PAGE_SAFE(p) (PAGE_OFFSET(p) <= (PAGE_SIZE - XMM_SIZE))
 
//
// The case-insensitive substring search function
//

const wchar_t* wcsistr(const wchar_t *wcs1, const wchar_t *wcs2)
{
    const wchar_t *s1, *s2;
    const wchar_t l = towlower_map(*wcs2);
    const wchar_t u = towupper_map(*wcs2);

    if (!*wcs2)
        return wcs1; // an empty substring matches everything

    if (__isa_available >= __ISA_AVAILABLE_SSE2)
    {
        __m128i patl, patu, zero, tmp1, tmp2, tmp3;
        int mask;
        unsigned long offset;

        // fill one XMM register completely with l, one with u and another with zeros

        patl = _mm_cvtsi32_si128(l);
        patl = _mm_shufflelo_epi16(patl, 0);
        patl = _mm_shuffle_epi32(patl, 0);
        patu = _mm_cvtsi32_si128(u);
        patu = _mm_shufflelo_epi16(patu, 0);
        patu = _mm_shuffle_epi32(patu, 0);
        zero = _mm_setzero_si128();

        // loop through string and try to match the first character

        for (;;)
        {
            // if page safe compare using the XMM registers

            if (XMM_PAGE_SAFE(wcs1))
            {
                // perform the comparison (also checks for end of string)

                tmp1 = _mm_loadu_si128((__m128i*)wcs1);  // load chunk of 8 characters at this position
                tmp2 = _mm_cmpeq_epi16(tmp1, patl);      // compare against lower case pattern, store mask
                tmp3 = _mm_cmpeq_epi16(tmp1, patu);      // compare against upper case pattern, store mask
                tmp2 = _mm_or_si128(tmp2, tmp3);         // combine both masks into tmp2
                tmp3 = _mm_cmpeq_epi16(tmp1, zero);      // compare against null character, store mask
                tmp2 = _mm_or_si128(tmp2, tmp3);         // combine both masks into tmp2
                mask = _mm_movemask_epi8(tmp2);          // convert to 32-bit mask

                // if no match found, continue with next chunk

                if (mask == 0)
                {
                    wcs1 += XMM_CHARS;
                    continue;
                }

                // advance string pointer to position of l, u or null

                _BitScanForward(&offset, mask);
                wcs1 += offset / sizeof(wchar_t);
            }

            // if at the end of string, no match found

            if (!*wcs1)
                return NULL;

            // if first character matches, compare against the full substring
            
            if (*wcs1 == l || *wcs1 == u)
            {
                s1 = wcs1 + 1;
                s2 = wcs2 + 1;

                while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
                    ++s1, ++s2;

                if (!*s2)
                    return wcs1;
            }

            ++wcs1;
        }
    }
    else
    {
        for (; *wcs1; ++wcs1)
        {
            if (*wcs1 == l || *wcs1 == u)
            {
                s1 = wcs1 + 1;
                s2 = wcs2 + 1;
                
                while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
                    ++s1, ++s2;
            
                if (!*s2)
                    return wcs1;
            }
        }
    }

    return NULL;
}

We have again improved speed, this time performing about 2.7~2.9 times slower than a regular substring search.

Optimizing the inner loop using SSE2

The last improvement involves optimization of the inner loop, the matching of the remainder of the substring. Although comparison of the entire substring to the current string position can be done using XMM registers by creating masks it involves continuous loading of characters. This will not result in a performance gain, for that reason we keep the original while-loop.

However most of the time is wasted in this expensive while-loop: comparing the second character, the third character maybe even the fourth character but then the match might fail at the fifth. In a large text there are many occurrences of the first character of the substring, still a lot occurrences of the first two characters in sequence, but less of the first three and even a lot less of the first four. I think you can see where this is going, after matching the first character in the outer loop we're going to test directly if the first 8 characters match the substring.

To do so we will cache the first 8 characters of the substring, one XMM register will hold the lower case version, the second register will hold the upper case version. When comparing the against the target string the combined mask should have a set bit (1) at every position (except in the case the substring is smaller than 8 characters, that is what the cache_mask variable is for). Putting this idea into practice delivers the final version of the function.

const wchar_t* wcsistr(const wchar_t *wcs1, const wchar_t *wcs2)
{
    const wchar_t *s1, *s2;
    const wchar_t l = towlower_map(*wcs2);
    const wchar_t u = towupper_map(*wcs2);

    if (!*wcs2)
        return wcs1; // an empty substring matches everything

    if (__isa_available >= __ISA_AVAILABLE_SSE2)
    {
        __m128i patl, patu, zero, tmp1, tmp2, tmp3, cachel, cacheu;
        int mask, i, cache_mask;
        unsigned long offset;

        // fill one XMM register completely with l, one with u and another with zeros

        patl = _mm_cvtsi32_si128(l);
        patl = _mm_shufflelo_epi16(patl, 0);
        patl = _mm_shuffle_epi32(patl, 0);
        patu = _mm_cvtsi32_si128(u);
        patu = _mm_shufflelo_epi16(patu, 0);
        patu = _mm_shuffle_epi32(patu, 0);
        zero = _mm_setzero_si128();

        // convert first 8 characters of substring to lower and upper case while putting
        // them in XMM registers

        cachel = _mm_setzero_si128();
        cacheu = _mm_setzero_si128();
        cache_mask = 0;
        s2 = wcs2;

        for (i = 0; i < XMM_CHARS; ++i)
        {
            cachel = _mm_srli_si128(cachel, sizeof(wchar_t)); // shift by one character
            cacheu = _mm_srli_si128(cacheu, sizeof(wchar_t));
            
            cachel = _mm_insert_epi16(cachel, towlower_map(*s2), XMM_CHARS-1); // insert character
            cacheu = _mm_insert_epi16(cacheu, towupper_map(*s2), XMM_CHARS-1);

            if (*s2)
            {
                cache_mask |= 3 << (i << 1); // decimal 3 is binary 11, set two bits for each character
                ++s2;
            }
        }

        // loop through string and try to match the first character

        for (;;)
        {
            // if page safe compare using the XMM registers

            if (XMM_PAGE_SAFE(wcs1))
            {
                // perform the comparison (also checks for end of string)

                tmp1 = _mm_loadu_si128((__m128i*)wcs1);  // load chunk of 8 characters at this position
                tmp2 = _mm_cmpeq_epi16(tmp1, patl);      // compare against lower case pattern, store mask
                tmp3 = _mm_cmpeq_epi16(tmp1, patu);      // compare against upper case pattern, store mask
                tmp2 = _mm_or_si128(tmp2, tmp3);         // combine both masks into tmp2
                tmp3 = _mm_cmpeq_epi16(tmp1, zero);      // compare against null character, store mask
                tmp2 = _mm_or_si128(tmp2, tmp3);         // combine both masks into tmp2
                mask = _mm_movemask_epi8(tmp2);          // convert to 32-bit mask

                // if no match found, continue with next chunk

                if (mask == 0)
                {
                    wcs1 += XMM_CHARS;
                    continue;
                }

                // advance string pointer to position of l, u or null

                _BitScanForward(&offset, mask);
                wcs1 += offset / sizeof(wchar_t);

                // if not at end of string and page safe, quickly check whether the first chunk
                // matches the substring

                if (*wcs1 && XMM_PAGE_SAFE(wcs1))
                {
                    tmp1 = _mm_loadu_si128((__m128i*)wcs1);
                    tmp2 = _mm_cmpeq_epi16(tmp1, cachel);
                    tmp3 = _mm_cmpeq_epi16(tmp1, cacheu);
                    tmp2 = _mm_or_si128(tmp2, tmp3);
                    mask = _mm_movemask_epi8(tmp2);

                    if (cache_mask == 0xFFFF) // only first part of substring in cache
                    {
                        if (mask == cache_mask)
                        {
                            s1 = wcs1 + XMM_CHARS;
                            s2 = wcs2 + XMM_CHARS;

                            while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
                                ++s1, ++s2;

                            if (!*s2)
                                return wcs1;
                        }
                    }
                    else // full substring is in cache
                    {
                        if ((mask & cache_mask) == cache_mask)
                            return wcs1;
                    }

                    // no full match found, try next character in string

                    ++wcs1;
                    continue; 
                }
            }

            // if at the end of string, no match found

            if (!*wcs1)
                return NULL;

            // if first character matches, compare against the full substring
            
            if (*wcs1 == l || *wcs1 == u)
            {
                s1 = wcs1 + 1;
                s2 = wcs2 + 1;

                while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
                    ++s1, ++s2;

                if (!*s2)
                    return wcs1;
            }

            ++wcs1;
        }
    }
    else
    {
        for (; *wcs1; ++wcs1)
        {
            if (*wcs1 == l || *wcs1 == u)
            {
                s1 = wcs1 + 1;
                s2 = wcs2 + 1;
                
                while (*s1 && *s2 && towlower_map(*s1) == towlower_map(*s2))
                    ++s1, ++s2;
            
                if (!*s2)
                    return wcs1;
            }
        }
    }

    return NULL;
}

Final result: a case-insensitive substring search is now about 2.2~2.4 times slower than a regular substring search.

Using the code

With this article I have attached a zip file which contains the C source file and its accompanying header file. Using this code in your project is very straightforward: just copy both files to your project directory and add them to your project in Visual Studio. To make use of the function you should notify the compiler of its existence and include the header file by:

#include "wcsistr.h"

Do not forget to set the current locale and initialize the case conversion cache by calling the newly introduced _wsetlocale_map function. The function should at least be called once and this is done preferably at program start.

_wsetlocale_map(LC_ALL, L"");

A typical case-insensitive substring search is now performed like:

const wchar_t *p = wcsistr(L"HeLLo, wOrLD!", L"world");

Points of Interest

Comparison of unicode strings involving complex characters require extra attention. In unicode characters are usually presented either as a single precomposed character or as a decomposed sequence of a base letter plus one or more non-spacing marks. If the representation of the string and substring is different the wcsistr function obviously fails, it is just comparing strings using on a 16-bit basis. Which means you need to normalize both strings with NormalizeString before using the wcsistr function. More information can be found here. But even then I highly doubt this will work in 100% of the cases, unicode is just very hard to get right.
Although SSE4.2 has specific instructions intended for string comparisons, the results I obtained using these instructions have been disappointing. At best the function still performed about 10~20% slower than the SSE2 accelerated version presented in this article.

I hope you have enjoyed this article. Please feel free to leave some feedback concerning errors, performance or other matters.

History

May 14, 2012 - Article published.