Click here to Skip to main content
15,119,514 members
Articles / General Programming / String
Tip/Trick
Posted 30 Jan 2015

Stats

8.9K views
85 downloads
5 bookmarked

Searching for shared motif between strings in C

Rate me:
Please Sign up or sign in to vote.
4.50/5 (2 votes)
30 Jan 2015CPOL3 min read
In this article, i consider the problem of identifying motifs in the biological sequence data sets. To solve this task, i present a new algorithm for finding patterns that just use <String.h> library in C compiler and a new way to avoid using arrays search.

Introduction

In this article, i consider the problem of identifying motifs in the biological sequence data sets. To solve this task, i present a new algorithm for finding patterns that just use <String.h> library in C compiler and a new way to avoid using arrays search.

Background

Finding motifs or repeated patterns in data is of wide scientific interest with many applications in genomic and proteomic analysis. The motif search problem abstracts many important problems in analysis of sequence data, where motifs are, for instance, biologically important patterns. For example, elucidating motifs in DNA sequences is a critical first step in understanding biological processes as basic as the RNA transcription. There, the motifs can be used to identify promoters, the regions in DNA that facilitate the transcription. Finding motifs can be equally crucial for analyzing interactions between viruses and cells or identification of disease-linked patterns. Discovery of motifs in music sequences, text, or time series data is a fundamental, general means of summarizing, mining and understanding large volumes of data. (Read More)

There is many ways and new algorithms for this topic but purposely in this article i just use <String.h> library in C language and i avoid using array search to find shared motid sub-string of two strings. this can help to code alghoritm in 50 lines or even less and also can reduce the percentage of error (especially in C compiler).

Also it can be a charming challenge!

Using the code

As we want to avoid using arrays so we need to use <String.h> library so we can include it to our project.

#include <string.h>

now we must search an string in another string. let char "s1" to be our first string and "s2" our second string and now searching for longest shared motif of s2 in s1. for example if s1 = "ACGTACGT" and s2 = "AACCGGTATA" then the longest shared motif can be "CG".

To do such a thing, our first string (s1) must lose its first char in each round of loop and our second string (s2) (also it lose its first char in every rounds of another loop) must be searched in our first string (s1) by strcmp() function which available in <string.h>.

also notice we must use a secondary loop for each string to counting and going forward. (because we want search each char of s2 in each char of s1).

Image 1

i use this peace of code for removing first char of each string to compare remaning string.

char *str = (char *)malloc(10);
strcpy(str, s1);
str = str + i;

and for better understanding i passed remaning strings to ss1 and sss1 which i define them at first (also for s2), and char shared_morif for final result.

//Defenition Area
char s1[100], ss1[100], sss1[100];
char s2[100], ss2[100], sss2[100];
char shared_morif[100] = "";

now complete code:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

/* run this program using the console pauser or add your own getch, system("pause") or input loop */
int main(int argc, char *argv[]) {
    //Defenition Area
    char s1[100], ss1[100], sss1[100];
    char s2[100], ss2[100], sss2[100];
    char shared_morif[100] = "";
    int i,j,k,l,result, ls1, ls2;
    scanf("%s", s1); scanf("%s", s2);
    ls1 = strlen(s1); ls2 = strlen(s2);
    //Main Algohritm
    for (i=0; i<ls1; i++) 
    {
        char *str = (char *)malloc(10);
        strcpy(str, s1);
        str = str + i;
        strcpy(ss1, str);
        for (j=0; j<ls2; j++)
        {
            char *str2 = (char *)malloc(10);
            strcpy(str2, s2);
            str2 = str2 + j;
            strcpy(ss2, str2);
            for (k=1; k<ls2 - j + 1; k++)
            {
                memset(sss2,0,strlen(sss2));
                strncpy(sss2, ss2, k);
                for (l=0; l<=strlen(ss1); l++)
                {
                    memset(sss1,0,strlen(sss1));
                    strncpy(sss1,ss1,l);
                    result = strcmp(sss1,sss2);
                    //printf("%s - %s | %s - %s\n",ss1,ss2,sss1,sss2); //FOR TRACE & TEST
                    if (result == 0)
                    {
                        if (strlen(shared_morif) < strlen(sss2))
                        {
                            memset(shared_morif,0,strlen(shared_morif));
                            strcpy(shared_morif, sss2);
                        }
                    }
                }
            }     
        }
    }
    printf("shared motif result is:\n"); puts(shared_morif);
    return 0;
}

note that in order to making strings null and clear i prefer to use memset() function.

memset(shared_morif,0,strlen(shared_morif));

at end i want mentioned to some notes when use string.h in c language:

  1. strcmp() returns 0 if the two strings are exactly the same.
  2. You cannot simply assign strings like (s1=("null") or sth like this) in C but you can use strcpy() for make string null.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Amir Mohammad Nasrollahi
Iran (Islamic Republic of) Iran (Islamic Republic of)
No Biography provided

Comments and Discussions

 
SuggestionGood but consider speed improvements Pin
Sanmayce23-Feb-15 8:34
MemberSanmayce23-Feb-15 8:34 
Hi Amir, it is good to see fellow member wrestling with texts, I myself still struggle with text searches.

As you know every approach/etude has its merits, currently you use malloc() and strcpy(), consider to get rid of them - they are speed breaks.

You may try your current code versus the plain strstr() one, compilers use SSE2, so searches for short needles (2,3 in particular) are superfast.
I expect the naive variant using memmem(), in fact strstr() is not useful, to be much faster than yours.

Have you tried simplified (only deletion no replace and no insertion) Levenshtein Distance approach?

Below you can use my old code to write such searcher:

C++
// Galadriel revision 1+++, copyleft Sanmayce 2013-Jan-21.
// An x-gram suggester using Levenshtein Distance.
// Fastest structureless fuzzy string searching for obtaining Levenshtein Distance?!
// Please give me a buzz (sanmayce@sanmayce.com) if you find faster implementation.
// Free download at www.sanmayce.com/Downloads/_Kaze_Levenshtein_Galadriel.zip
// Note: Still not tested under Linux, should work.
// Enfun!

// TESTspeed.bat:
// @echo Results (2 hits) on my 'Bonboniera' laptop Core 2 T7500 2200MHz:
// @echo Revision 1 works at 251,922 xgrams/s, 'SUCH A SHAME', a lovely old song!
// @echo Revision 1+ works at 1,007,691 xgrams/s, still shamy.
// @echo Revision 1++ works at 2,519,228 xgrams/s, a very poor result.
// @echo Revision 1+++ works at 5,038,456 xgrams/s, in a heavy test, 444% faster than r.1++, damage undone.
// Galadriel.exe 52 psychad----__________qwertyuiop---------------------------licize googlebooks-eng-all-1gram-20120701_5038456_words.wrd
// type Galadriel.txt

#define _WIN32_ENVIRONMENT_
//#define _POSIX_ENVIRONMENT_
#define MaxLineLength 128
#define KAZE_tolower(c) ( (((c) >= 'A') && ((c) <= 'Z')) ? ((c) - 'A' + 'a') : (c) )
#define KAZE_toupper(c) ( (((c) >= 'a') && ((c) <= 'z')) ? ((c) - 'a' + 'A') : (c) )
//#define _WIN32ASM_

#if defined(_WIN32ASM_)
// Optimizing subroutines in assembly language
// An optimization guide for x86 platforms
// By Agner Fog. Copenhagen University College of Engineering.
/*
It is possible to calculate the absolute value of a signed integer without branching:
; Example 9.15, Calculate absolute value of eax
	cdq ; Copy sign bit of eax to all bits of edx
	xor eax, edx ; Invert all bits if negative
	sub eax, edx ; Add 1 if negative
The following example finds the minimum of two unsigned numbers: if (b > a) b = a;
; Example 9.16a, Find minimum of eax and ebx (unsigned):
	sub eax, ebx ; = a-b
	sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
	and edx, eax ; = (b > a) ? a-b : 0
	add ebx, edx ; Result is in ebx
*/
// Sadly MS 64bit compiler accepts not the inline ASM: error C4235: nonstandard extension used : '__asm' keyword not supported on this architecture
// MASM style inline assembly, 32 bit mode
unsigned int abs_AF (int n) {
__asm {
	mov eax, n 	// Move n to eax
			// abs(n) is calculated by inverting all bits and adding 1 if n < 0:
	cdq 		// Get sign bit into all bits of edx
	xor eax, edx 	// Invert bits if negative
	sub eax, edx 	// Add 1 if negative. Now eax = abs(n)
} 			// Return value is in eax
}
unsigned int min_AF (int a, int b, int c) {
__asm {
	mov eax, a 	// Move a to eax
	mov ebx, b 	// Move b to ebx

	sub eax, ebx ; = a-b
	sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
	and edx, eax ; = (b > a) ? a-b : 0
	add ebx, edx ; Result is in ebx

	mov eax, c 	// Move c to eax

	sub eax, ebx ; = a-b
	sbb edx, edx ; = (b > a) ? 0xFFFFFFFF : 0
	and edx, eax ; = (b > a) ? a-b : 0
	add ebx, edx ; Result is in ebx

	mov eax, ebx ; Return value is in eax
}
}
#endif

long KAZE_strlenLF (const char * str)
{
        const char *eos = str;
        char LFa[1];
        LFa[0] = 10; //BUG UNcrushed yet: for Windows 13 for POSIX 10
        while( *eos++ != LFa[0] ) ;

        return( (int)(eos - str - 1) );
}

void x64toaKAZE (      /* stdcall is faster and smaller... Might as well use it for the helper. */
        unsigned long long val,
        char *buf,
        unsigned radix,
        int is_neg
        )
{
        char *p;                /* pointer to traverse string */
        char *firstdig;         /* pointer to first digit */
        char temp;              /* temp char */
        unsigned digval;        /* value of digit */

        p = buf;

        if ( is_neg )
        {
            *p++ = '-';         /* negative, so output '-' and negate */
            val = (unsigned long long)(-(long long)val);
        }

        firstdig = p;           /* save pointer to first digit */

        do {
            digval = (unsigned) (val % radix);
            val /= radix;       /* get next digit */

            /* convert to ascii and store */
            if (digval > 9)
                *p++ = (char) (digval - 10 + 'a');  /* a letter */
            else
                *p++ = (char) (digval + '0');       /* a digit */
        } while (val > 0);

        /* We now have the digit of the number in the buffer, but in reverse
           order.  Thus we reverse them now. */

        *p-- = '\0';            /* terminate string; p points to last digit */

        do {
            temp = *p;
            *p = *firstdig;
            *firstdig = temp;   /* swap *p and *firstdig */
            --p;
            ++firstdig;         /* advance to next two digits */
        } while (firstdig < p); /* repeat until halfway */
}

/* Actual functions just call conversion helper with neg flag set correctly,
   and return pointer to buffer. */

char * _i64toaKAZE (
        long long val,
        char *buf,
        int radix
        )
{
        x64toaKAZE((unsigned long long)val, buf, radix, (radix == 10 && val < 0));
        return buf;
}

char * _ui64toaKAZE (
        unsigned long long val,
        char *buf,
        int radix
        )
{
        x64toaKAZE(val, buf, radix, 0);
        return buf;
}

char * _ui64toaKAZEzerocomma (
        unsigned long long val,
        char *buf,
        int radix
        )
{
                        char *p;
                        char temp;
                        int txpman;
                        int pxnman;
        x64toaKAZE(val, buf, radix, 0);
                        p = buf;
                        do {
                        } while (*++p != '\0');
                        p--; // p points to last digit
                             // buf points to first digit
                        buf[26] = 0;
                        txpman = 1;
                        pxnman = 0;
                        do
                        { if (buf <= p)
                          { temp = *p;
                            buf[26-txpman] = temp; pxnman++;
                            p--;
                            if (pxnman % 3 == 0)
                            { txpman++;
                              buf[26-txpman] = (char) (',');
                            }
                          }
                          else
                          { buf[26-txpman] = (char) ('0'); pxnman++;
                            if (pxnman % 3 == 0)
                            { txpman++;
                              buf[26-txpman] = (char) (',');
                            }
                          }
                          txpman++;
                        } while (txpman <= 26);
        return buf;
}

char * _ui64toaKAZEcomma (
        unsigned long long val,
        char *buf,
        int radix
        )
{
                        char *p;
                        char temp;
                        int txpman;
                        int pxnman;
        x64toaKAZE(val, buf, radix, 0);
                        p = buf;
                        do {
                        } while (*++p != '\0');
                        p--; // p points to last digit
                             // buf points to first digit
                        buf[26] = 0;
                        txpman = 1;
                        pxnman = 0;
                        while (buf <= p)
                        { temp = *p;
                          buf[26-txpman] = temp; pxnman++;
                          p--;
                          if (pxnman % 3 == 0 && buf <= p)
                          { txpman++;
                            buf[26-txpman] = (char) (',');
                          }
                          txpman++;
                        } 
        return buf+26-(txpman-1);
}

// Wagner–Fischer algorithm
// From Wikipedia, the free encyclopedia
/*
 int LevenshteinDistance(char s[1..m], char t[1..n])
 {
   // for all i and j, d[i,j] will hold the Levenshtein distance between
   // the first i characters of s and the first j characters of t;
   // note that d has (m+1)x(n+1) values
   declare int d[0..m, 0..n]
  
   for i from 0 to m
     d[i, 0] := i // the distance of any first string to an empty second string
   for j from 0 to n
     d[0, j] := j // the distance of any second string to an empty first string
  
   for j from 1 to n
   {
     for i from 1 to m
     {
       if s[i] = t[j] then  
         d[i, j] := d[i-1, j-1]       // no operation required
       else
         d[i, j] := minimum
                    (
                      d[i-1, j] + 1,  // a deletion
                      d[i, j-1] + 1,  // an insertion
                      d[i-1, j-1] + 1 // a substitution
                    )
     }
   }
  
   return d[m,n]
 }
*/

#if defined(_WIN32_ENVIRONMENT_)
#include <io.h> // needed for Windows' 'lseeki64' and 'telli64'
//Above line must be commented in order to compile with Intel C compiler: an error "can't find io.h" occurs.
#else
#endif /* defined(_WIN32_ENVIRONMENT_)  */

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <time.h>

int Levenshtein[MaxLineLength+1][MaxLineLength+1]; // declare int d[0..m, 0..n]
#define MIN(x,y) ((x) < (y) ? (x) : (y))
#define MAX(x,y) ((x) > (y) ? (x) : (y))
int main(int argc, char **argv) {

FILE *fp_inLINE;
FILE *fp_outLINE;

unsigned char workK[1024*32];
signed int workKoffset = -1;

unsigned long long FilesLEN;
unsigned long long k, k2, k3;
unsigned int LINE10len, wrdlen;

#if defined(_WIN32_ENVIRONMENT_)
      unsigned long long size_inLINESIXFOUR;
#else
      size_t size_inLINESIXFOUR;
#endif /* defined(_WIN32_ENVIRONMENT_)  */

unsigned char wrd[MaxLineLength+1+1]; // crlf
unsigned char wrdCACHED[MaxLineLength+1+1]; // crlf
unsigned char workbyte;
time_t t1, t2, t3;
unsigned long long TotalLines=0;
unsigned long long WordsChecked=0;
unsigned long long DumpedLines=0;
unsigned int AtMostLevenshteinDistance;
unsigned int SkipHeuristic;
unsigned int StartingPosition;
char llTOaDigits[27]; // 9,223,372,036,854,775,807: 1(sign or carry)+19(digits)+1('\0')+6(,)
char llTOaDigits2[27]; // 9,223,372,036,854,775,807: 1(sign or carry)+19(digits)+1('\0')+6(,)
char llTOaDigits3[27]; // 9,223,372,036,854,775,807: 1(sign or carry)+19(digits)+1('\0')+6(,)

/*
Upper and lower bounds:
The Levenshtein distance has several simple upper and lower bounds that are useful in applications which compute many of them and compare them. These include:
- It is always at least the difference of the sizes of the two strings.
- It is at most the length of the longer string.
- It is zero if and only if the strings are identical.
- If the strings are the same size, the Hamming distance is an upper bound on the Levenshtein distance.
Possible improvements to this algorithm include:
- We can adapt the algorithm to use less space, O(m) instead of O(mn), since it only requires that the previous row and current row be stored at any one time.
- If we are only interested in the distance if it is smaller than a threshold k, then it suffices to compute a diagonal stripe of width 2k+1 in the matrix. In this way, the algorithm can be run in O(kl) time, where l is the length of the shortest string.[1]
*/

// A very good resource:
// http://shaunwagner.com/writings_computer_levenshtein.html

	int i,j,m,n;
	char s[] = "sitting";
	char t[] = "kitten";
	m = strlen(s);
	n = strlen(t);
// 		k 	i 	t 	t 	e 	n
// 	0 	1 	2 	3 	4 	5 	6
// s 	1 	1 	2 	3 	4 	5 	6
// i 	2 	2 	1 	2 	3 	4 	5
// t 	3 	3 	2 	1 	2 	3 	4
// t 	4 	4 	3 	2 	1 	2 	3
// i 	5 	5 	4 	3 	2 	2 	3
// n 	6 	6 	5 	4 	3 	3 	2
// g 	7 	7 	6 	5 	4 	4 	3

/*
	for(i=0;i<=m;i++)
		Levenshtein[i][0] = i;
	for(j=0;j<=n;j++)
		Levenshtein[0][j] = j;
	for (j=1;j<=n;j++) {
		for(i=1;i<=m;i++) {
			if(s[i-1] == t[j-1])
				Levenshtein[i][j] = Levenshtein[i-1][j-1];
			else
				Levenshtein[i][j] = MIN(MIN((Levenshtein[i-1][j]+1),(Levenshtein[i][j-1]+1)),(Levenshtein[i-1][j-1]+1));
		}
	}
	printf("Levenshtein Distance: %d\n", Levenshtein[m][n]);
	exit (0);
*/

	printf("Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-21.\n");	
	if (argc != 4) {
	printf("Usage: Galadriel AtMostLevenshteinDistance xgram xgramfile\n");	
	printf("Note: Longest xgram could be %d chars.\n", MaxLineLength);	
	printf("Example1: E:\\>Galadriel 0 ramjet MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd\n");	
	printf("Example2: E:\\>Galadriel 3 psychedlicize MASAKARI_General-Purpose_Grade_English_Wordlist_r3_316423_words.wrd\n");	
	} else {
	(void) time(&t1);
	AtMostLevenshteinDistance = atoi(argv[1]);
	n = strlen(argv[2]);
	if (n>MaxLineLength)
	{ printf( "Galadriel: Incoming xgram exceeding the limit.\n" ); return( 1 ); }

	memset (wrdCACHED, 0, MaxLineLength+1+1);
	for(i=0;i<=MaxLineLength;i++)
		Levenshtein[i][0] = i;
	for(j=0;j<=MaxLineLength;j++)
		Levenshtein[0][j] = j;

if( ( fp_inLINE = fopen( argv[3], "rb" ) ) == NULL )
{ printf( "Galadriel: Can't open %s file.\n", argv[3] ); return( 1 ); }

if( ( fp_outLINE = fopen( "Galadriel.txt", "wb" ) ) == NULL )
{ printf( "Galadriel: Can't open Galadriel.txt file.\n" ); return( 1 ); }

#if defined(_WIN32_ENVIRONMENT_)
   // 64bit:
_lseeki64( fileno(fp_inLINE), 0L, SEEK_END );
size_inLINESIXFOUR = _telli64( fileno(fp_inLINE) );
_lseeki64( fileno(fp_inLINE), 0L, SEEK_SET );
#else
   // 64bit:
fseeko( fp_inLINE, 0L, SEEK_END );
size_inLINESIXFOUR = ftello( fp_inLINE );
fseeko( fp_inLINE, 0L, SEEK_SET );
#endif /* defined(_WIN32_ENVIRONMENT_)  */

        //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
	FilesLEN = size_inLINESIXFOUR;
	wrdlen = 0;
        for( k = 0; k < size_inLINESIXFOUR; k++ )
	{
                // ~~~~~~~~~~~~ Buffering fread, 10x faster [
                if (workKoffset == -1) {
                        if (k + 1024*32 < size_inLINESIXFOUR) {
                                fread( &workK[0], 1, 1024*32, fp_inLINE );
                                workKoffset = 0;
                                workbyte = workK[workKoffset];
                        } else 
                        fread( &workbyte, 1, 1, fp_inLINE );
                } else {
                        workKoffset++;
                        workbyte = workK[workKoffset];
                        if (workKoffset == 1024*32 - 1) workKoffset = -1;
                }
                // ~~~~~~~~~~~~ Buffering fread, 10x faster ]
                // ~~~~~~~~~~~~ UnBuffered fread, 10x slower [
                        //fread( &workbyte, 1, 1, fp_inLINE );
                // ~~~~~~~~~~~~ UnBuffered fread, 10x slower ]
                        if( wrdlen < MaxLineLength +1+1)
	                        { wrd[ wrdlen ] = workbyte; } // KAZE_tolower( workbyte ); 
                        if (workbyte == 10) {
				TotalLines++;
// Wildcard search [
                        if ( 0 < wrdlen && wrdlen < MaxLineLength +1+1)
			{
				wrd[ wrdlen ] = 0;
				if ( wrd[ wrdlen-1 ] == 13 ) //CR
					wrd[ --wrdlen ] = 0;

// A simple heuristic #1: Don't enter the nasty loops unless MaximumLevenshteinDistance >= ABS(m-n).
m = wrdlen; // strlen(wrd);
if (m>MaxLineLength)
{ printf( "Galadriel: Incoming xgram exceeding the limit.\n" ); return( 1 ); }

// 400: if (AtMostLevenshteinDistance >= abs_AF(m-n))
// ; Line 400
// 	mov	eax, ebx
// 	sub	eax, edi
// 	mov	DWORD PTR $T5736[esp+33112], eax
// 	mov	eax, DWORD PTR $T5736[esp+33112]
// 	cdq
// 	xor	eax, edx
// 	sub	eax, edx
// 	cmp	DWORD PTR _AtMostLevenshteinDistance$[esp+33112], eax

// 372: if (AtMostLevenshteinDistance >= MAX(m,n)-MIN(m,n))
// ; Line 372
// 	mov	eax, ebx
// 	cmp	ebx, edi
// 	jg	SHORT $LN43@main
// 	mov	eax, edi
// 	mov	ecx, ebx
// 	jl	SHORT $LN44@main
// $LN43@main:
// 	mov	ecx, edi
// $LN44@main:
// 	sub	eax, ecx
// 	cmp	DWORD PTR _AtMostLevenshteinDistance$[esp+33112], eax

#if defined(_WIN32ASM_)
if (AtMostLevenshteinDistance >= abs_AF(m-n))
#else
if (AtMostLevenshteinDistance >= MAX(m,n)-MIN(m,n)) // This is the only add-on for r.1+
#endif
{
// LD [

// A simple heuristic #3: Don't recalculate rows for identical leftmost chars we already have.
// Caution: heuristic #3 works in pair with heuristic #2 (the bail out position is less or equal to cached chars), whereas heuristic #1 is standalone.
// heuristic #3 in action:
// E:\_Kaze_Levenshtein_Galadriel>Galadriel.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
// Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-21.
// Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
// Galadriel: Performance: 1,950,892 xgrams/s
StartingPosition = 1;
while (	wrdCACHED[StartingPosition-1] && wrdCACHED[StartingPosition-1]==wrd[StartingPosition-1]) // No need of && wrd[StartingPosition-1] 
	StartingPosition++;
// The bail out 'i' value (heuristic #2) affects our cached value here, 'StartingPosition' cannot be greater than 'i':
StartingPosition = MIN(StartingPosition, i);

//	for(i=0;i<=m;i++)
//		Levenshtein[i][0] = i;
//	for(j=0;j<=n;j++)
//		Levenshtein[0][j] = j;

// Printing Matrix [
// printf("\n%s", wrd);
// Printing Matrix ]

	WordsChecked++;
	SkipHeuristic=0;
	for(i=StartingPosition;i<=m;i++) { // StartingPosition is in range 1..
// Printing Matrix [
// printf("\n");
// Printing Matrix ]
		for (j=1;j<=n;j++) {
			if(wrd[i-1] == argv[2][j-1])
				Levenshtein[i][j] = Levenshtein[i-1][j-1];
			else
#if defined(_WIN32ASM_)
				Levenshtein[i][j] = min_AF(Levenshtein[i-1][j]+1, Levenshtein[i][j-1]+1, Levenshtein[i-1][j-1]+1); // Variant 1: 237,270 xgrams/s
#else
				Levenshtein[i][j] = MIN(MIN((Levenshtein[i-1][j]+1),(Levenshtein[i][j-1]+1)),(Levenshtein[i-1][j-1]+1)); // Variant 2: This MS code is faster than above jumpless code! // 358,327 xgrams/s
				//{Levenshtein[i][j] = MIN(MIN(Levenshtein[i-1][j],Levenshtein[i][j-1]),Levenshtein[i-1][j-1]); --Levenshtein[i][j];} // Variant 3: This compound line is much slower than above inc-inc-inc code! // 237,270 xgrams/s
#endif
// Printing Matrix [
// printf("%s ", _ui64toaKAZEzerocomma(Levenshtein[i][j], llTOaDigits, 10)+(26-2) );
// Printing Matrix ]
		}

// Seeing how Variant 1 and Variant 3 perform at one level (falling flat) it is obvious that Variant 2 walks on some edge!
// Disappointing performance for inline functions, the ij loop is so critical that code really matters! (358,327-237,270)/237,270*100% = 51%, a schocking deboost!
// For both runs below, heuristic #3 was NOT in action (when in action the gain is (1,950,892-358,327)/358,327*100% = 444%):

// E:\_Kaze_Levenshtein_Galadriel>Galadriel_pureC.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
// Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-19.
// Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
// Galadriel: Performance: 358,327 xgrams/s
// 
// E:\_Kaze_Levenshtein_Galadriel>Galadriel.exe 9 0,000,001_psychedelized_???? 4andabove_Gamera.tar.2.sorted
// Galadriel, an x-gram suggesteress using Wagner-Fischer Levenshtein Distance, revision 1+++, copyleft Sanmayce 2013-Jan-19.
// Galadriel: Total/Checked/Dumped xgrams: 35,116,064/31,763,627/33
// Galadriel: Performance: 237,270 xgrams/s
// 
// E:\_Kaze_Levenshtein_Galadriel>dir 4andabove_Gamera.tar.2.sorted
//  Volume in drive E is SSD_Sanmayce
//  Volume Serial Number is 9CF6-FEA3
// 
//  Directory of E:\_Kaze_Levenshtein_Galadriel
// 
// 01/20/2013  11:59 PM       889,537,624 4andabove_Gamera.tar.2.sorted
//                1 File(s)    889,537,624 bytes
//                0 Dir(s)  36,250,083,328 bytes free
// 
// E:\_Kaze_Levenshtein_Galadriel>type Galadriel.txt
// 0,000,061       psychedelic_drug
// 0,000,011       psychedelics_to
// 0,000,011       psychedelic_s
// 0,000,009       psychedelics_is
// 0,000,009       psychedelic_vans
// 0,000,009       psychedelic_soma
// 0,000,009       psychedelic_dreams
// 0,000,009       psychedelic_as
// 0,000,009       psychedelic_age
// 0,000,008       psychedelics_have
// 0,000,008       psychedelic_dr
// 0,000,008       psychedelic_but
// 0,000,008       psychedelic_band
// 0,000,007       psychedelic_than
// 0,000,007       psychedelic_can
// 0,000,007       psychedelic_art
// 0,000,006       psychedelics_they
// 0,000,006       psychedelics_lsd
// 0,000,006       psychedelics_do
// 0,000,006       psychedelic_with
// 0,000,006       psychedelic_pop
// 0,000,006       psychedelic_mind
// 0,000,005       psychedelics_will
// 0,000,005       psychedelics_the
// 0,000,005       psychedelics_or
// 0,000,005       psychedelic_that
// 0,000,005       psychedelic_is
// 0,000,004       psychedelics_were
// 0,000,004       psychedelics_use
// 0,000,004       psychedelics_such
// 0,000,004       psychedelics_for
// 0,000,004       psychedelic_sh*t
// 0,000,004       psychedelic_it
// 
// E:\_Kaze_Levenshtein_Galadriel>

// ; 488  : 				Levenshtein[i][j] = min_AF(Levenshtein[i-1][j]+1, Levenshtein[i][j-1]+1, Levenshtein[i-1][j-1]+1);
// 
//   00598	8b 11		 mov	 edx, DWORD PTR [ecx]
//   0059a	8b 81 04 02 00
// 	00		 mov	 eax, DWORD PTR [ecx+516]
//   005a0	42		 inc	 edx
//   005a1	89 54 24 70	 mov	 DWORD PTR $T5746[esp+33116], edx
//   005a5	8b 54 24 38	 mov	 edx, DWORD PTR tv842[esp+33116]
//   005a9	40		 inc	 eax
//   005aa	03 d6		 add	 edx, esi
//   005ac	89 44 24 74	 mov	 DWORD PTR $T5745[esp+33116], eax
//   005b0	8b 04 95 fc fd
// 	ff ff		 mov	 eax, DWORD PTR _Levenshtein[edx*4-516]
//   005b7	40		 inc	 eax
//   005b8	89 44 24 6c	 mov	 DWORD PTR $T5744[esp+33116], eax
//   005bc	8b 44 24 6c	 mov	 eax, DWORD PTR $T5744[esp+33116]
//   005c0	8b 5c 24 74	 mov	 ebx, DWORD PTR $T5745[esp+33116]
//   005c4	2b c3		 sub	 eax, ebx
//   005c6	1b d2		 sbb	 edx, edx
//   005c8	23 d0		 and	 edx, eax
//   005ca	03 da		 add	 ebx, edx
//   005cc	8b 44 24 70	 mov	 eax, DWORD PTR $T5746[esp+33116]
//   005d0	2b c3		 sub	 eax, ebx
//   005d2	1b d2		 sbb	 edx, edx
//   005d4	23 d0		 and	 edx, eax
//   005d6	03 da		 add	 ebx, edx
//   005d8	8b c3		 mov	 eax, ebx
//   005da	8b 5c 24 78	 mov	 ebx, DWORD PTR tv643[esp+33116]
// $LN83@main:

// ; 487  : 				Levenshtein[i][j] = MIN(MIN((Levenshtein[i-1][j]+1),(Levenshtein[i][j-1]+1)),(Levenshtein[i-1][j-1]+1));
// 
//   0068a	8b 4c 24 38	 mov	 ecx, DWORD PTR tv943[esp+33112]
//   0068e	03 ce		 add	 ecx, esi
//   00690	8b 04 8d fc fd
// 	ff ff		 mov	 eax, DWORD PTR _Levenshtein[ecx*4-516]
//   00697	8b 0a		 mov	 ecx, DWORD PTR [edx]
//   00699	40		 inc	 eax
//   0069a	41		 inc	 ecx
//   0069b	8b e8		 mov	 ebp, eax
//   0069d	3b c1		 cmp	 eax, ecx
//   0069f	7c 02		 jl	 SHORT $LN42@main
//   006a1	8b e9		 mov	 ebp, ecx
// $LN42@main:
//   006a3	8b ba fc fd ff
// 	ff		 mov	 edi, DWORD PTR [edx-516]
//   006a9	47		 inc	 edi
//   006aa	3b ef		 cmp	 ebp, edi
//   006ac	7d 08		 jge	 SHORT $LN45@main
//   006ae	3b c1		 cmp	 eax, ecx
//   006b0	7c 06		 jl	 SHORT $LN46@main
//   006b2	8b c1		 mov	 eax, ecx
//   006b4	eb 02		 jmp	 SHORT $LN46@main
// $LN45@main:
//   006b6	8b c7		 mov	 eax, edi
// $LN46@main:

		// A simple heuristic #2: Discontinue the nasty vertical loop (i.e. m) when the LD in cell in the last column minus the remaining vertical cycles is greater than our MAX LD:
		if ( Levenshtein[i][n] - (m-i) >= 0 && AtMostLevenshteinDistance < Levenshtein[i][n] - (m-i) ) {SkipHeuristic=1; break;} // Caution: Levenshtein[i][n] can be less than (m-i), this changes nothing the logic is the same.
		//          C  C  C  C  C  C  C  C  C  C  C  C  C
		//          o  o  o  o  o  o  o  o  o  o  o  o  o
		//          l  l  l  l  l  l  l  l  l  l  l  l  l
		//          u  u  u  u  u  u  u  u  u  u  u  u  u
		//          m  m  m  m  m  m  m  m  m  m  m  m  m
		//          n  n  n  n  n  n  n  n  n  n  n  n  n
		//          0  0  0  0  0  0  0  0  0  1  1  1  1
		//          1  2  3  4  5  6  7  8  9  0  1  2  3
		//          p  s  y  c  h  e  d  l  i  c  i  z  e
		// p Row01: 00 01 02 03 04 05 06 07 08 09 10 11 12 !Caution! Here the last cell Levenshtein[i][n] - (m-i) = 12 - (14-1) = -1
		// s Row02: 01 00 01 02 03 04 05 06 07 08 09 10 11
		// y Row03: 02 01 00 01 02 03 04 05 06 07 08 09 10
		// c Row04: 03 02 01 00 01 02 03 04 05 06 07 08 09
		// h Row05: 04 03 02 01 00 01 02 03 04 05 06 07 08
		// e Row06: 05 04 03 02 01 00 01 02 03 04 05 06 07
		// d Row07: 06 05 04 03 02 01 00 01 02 03 04 05 06
		// e Row08: 07 06 05 04 03 02 01 01 02 03 04 05 05
		// l Row09: 08 07 06 05 04 03 02 01 02 03 04 05 06
		// i Row10: 09 08 07 06 05 04 03 02 01 02 03 04 05
		// c Row11: 10 09 08 07 06 05 04 03 02 01 02 03 04
		// i Row12: 11 10 09 08 07 06 05 04 03 02 01 02 03
		// z Row13: 12 11 10 09 08 07 06 05 04 03 02 01 02
		// e Row14: 13 12 11 10 09 08 07 06 05 04 03 02 01
	}
				if (SkipHeuristic==0 && AtMostLevenshteinDistance >= Levenshtein[m][n]) {
					//printf("%s\n", wrd);
					fprintf( fp_outLINE, "%s\n", wrd); DumpedLines++;
					if ((DumpedLines & 0xff) == 0xff)
						//printf( "Dumped lines i.e. hits so far: %s\r", _ui64toaKAZEcomma(DumpedLines, llTOaDigits, 10) ); 
						fflush(fp_outLINE); // Not sure: CTRL+C doesn't flush?!
				}
	// The bail out 'i' value (heuristic #2) affects our cached value here, 'i' is the needed one:
	memcpy(wrdCACHED, wrd, wrdlen+1); // +1 because we need the ASCII 000 termination
// LD ]
}
			}
// Wildcard search ]
				wrdlen = 0;
                        }
                        else wrdlen++;
        } // k 'for'
        //~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

	fclose(fp_inLINE);
	fclose(fp_outLINE);
	(void) time(&t3);
	if (t3 <= t1) {t3 = t1; t3++;}
				
	printf( "Galadriel: Total/Checked/Dumped xgrams: %s/%s/%s\n", _ui64toaKAZEcomma(TotalLines, llTOaDigits3, 10), _ui64toaKAZEcomma(WordsChecked, llTOaDigits2, 10), _ui64toaKAZEcomma(DumpedLines, llTOaDigits, 10) );
	printf( "Galadriel: Performance: %s xgrams/s\n", _ui64toaKAZEcomma((TotalLines)/((int) t3-t1), llTOaDigits, 10) );
	}
	exit (0);
}


Hope it can be of use to your attempts.

Edit:
In case of not having memmem(), Windows lacks it, here is mine:
C++
// All Railgun variants are written by Georgi 'Kaze', they are free, however I expect the user to mention its homepage, that is: http://www.sanmayce.com/Railgun/index.html
// Author's email: sanmayce@sanmayce.com
// Caution: For better speed the case 'if (cbPattern==1)' was removed, so Pattern must be longer than 1 char.
char * Railgun_Doublet (char * pbTarget, char * pbPattern, uint32_t cbTarget, uint32_t cbPattern)
{
	char * pbTargetMax = pbTarget + cbTarget;
	register uint32_t ulHashPattern;
	uint32_t ulHashTarget, count, countSTATIC;

	if (cbPattern > cbTarget) return(NULL);

	countSTATIC = cbPattern-2;

	pbTarget = pbTarget+cbPattern;
	ulHashPattern = (*(uint16_t *)(pbPattern));

	for ( ;; ) {
		if ( ulHashPattern == (*(uint16_t *)(pbTarget-cbPattern)) ) {
			count = countSTATIC;
			while ( count && *(char *)(pbPattern+2+(countSTATIC-count)) == *(char *)(pbTarget-cbPattern+2+(countSTATIC-count)) ) {
				count--;
			}
			if ( count == 0 ) return((pbTarget-cbPattern));
		}
		pbTarget++;
		if (pbTarget > pbTargetMax) return(NULL);
	}
}


modified 23-Feb-15 14:50pm.

AnswerRe: Good but consider speed improvements Pin
Amir Mohammad Nasrollahi28-Feb-15 7:09
MemberAmir Mohammad Nasrollahi28-Feb-15 7:09 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.