Smart TextDiff Utility for Intel® 64 Platform

Arthur V. Ratz

Rate me:

4.94/5 (78 votes)

4 May 2016CPOL22 min read

54.6K

1.4K

Smart cross-platform textdiff lexicographical analyzer utility

This article demonstrates the implementation of a smart cross-platform textdiff lexicographical analyzer utility which can be used to find and locate the differences between two files.

Introduction

It’s more than 40 years since in 1970s Douglas McIlroy and James W. Hunt, developers in AT&T Bell Labs (Murrey Hill, New Jersey), introduced the first release of diff utility as part of the 5^th-gen Unix V distribution. diff is a small file comparison utility that performs the lexicographic analysis to find and locate the differences between two files as the result of changes being made to the contents of these files by a program during its execution.

Initially, the implementation of the diff utility was mainly based on using the either heuristic or Levinshtein distance probability characters-based algorithms. The main disadvantage of these algorithms is that the following algorithm actually perform a prediction of the characters to be inserted, deleted or modified at a certain position in a string based on each character’s probability computation, which makes it impossible to use the following algorithm to analyze and compare short strings since that to compute the distance between two predictable characters by using the following algorithms, we must have the information enough about the probability of each preceding character to be transformed into another one by performing those character’s insertion, deletion or modification, so that as many characters have been previously predicted with a certain probability as the most precise value of the edit-distance can be computed for the current character by using the Levinshtein distance algorithm. That’s actually why, the Levinshtein and heuristic algorithms are considered to be the least efficient to compare the short character’s sequences.

In this article, we’ll discuss about the algorithm that, unlike the Levinshtein distance and heuristic algorithms, allows to find the differences between two files based on the performing of context search to locate similar substrings persisting in both files using offset-based context search algorithm.

The following article also contains the ANSI C implementation of the alternative textdiff utility using the algorithm being discussed. The textdiff utility being introduced in the following article was developed and tested on the Windows platform, but its source code can also be used to maintain the following utility for the other platforms such as Linux or iMac. Also the following source code uses the 64-bit data types such as __int64 to provide the support for comparing and analyzing files having size of more than 1GB.

Since the development of file comparison and lexicographic analysis tools such as textdiff is still relevant and being a point of interest, there’re the large number of the similar products released by the different vendors including IDM UltraCompare, WinMerge as well as the DiffNow and DiffChecker online tools. Each of the mentioned above products has been implemented by using different algorithms providing the different results of files comparison and analysis.

The main goal of the following article is to demonstrate the algorithm which is capable of providing the optimal results of the file lexicographical comparison and analysis compared to the other existing text comparison tools and utilities.

As well as the basic algorithm discussion, in this article, we’ll perform the analysis of the source code of the textdiff utility being introduced by providing code snippets written in C to each phase of the algorithm being discussed throughout in this article.

Background

In this section, we’ll discuss more about the algorithm that allows to compare and analyze the contents of two files by performing a context search to find and compute the position of the all non-identical fragments that might persist as the result of the changes being made to the contents of these two files. The following algorithm can be used to perform the comparison of either plain-text or binary files.

Algorithm of Finding and Locating the Differences Between Two Files

The algorithm being discussed in this article is intended to lexicographically compare the contents of two files stored into the two previously allocated string buffers str1[] and str2[] at the beginning of code execution. To demonstrate the process of files comparison by using the following algorithm, we’ll use two plain-text files, each one containing a short text string that consists of the number of either identical or non-identical fragments. The non-identical fragments are typically those fragments contained in the first str1 [] buffer that are actually missing from the second buffer str2 [].

Before we begin, let’s go over some considerations that further will be used during the following algorithm discussion. Suppose we need to identify the all non-identical fragments contained in the first str1[] buffer. For that purpose, we’ll be using the second str2 [] buffer to find the fragments persisting in both str1[] and str2 [] buffers by performing the context search and actually exclude them to locate the starting and final positions of those non-identical fragments from str1 [] not contained in str2 []. Since that, we consider the second str2[] buffer to be a “pattern” of search. And vice versa, if we need to locate the same non-identical fragments from the str2[] buffer we’re treating the first str1[] buffer as the “pattern” of search.

Finding and Locating Non-Identical Fragments

Now, it’s time to take a closer look at the algorithm itself being discussed. Since we have two string buffers string [] and pattern [] which have not been previously analyzed, to find the first non-identical fragment, according to the following algorithm, we should perform loop iteration over the two string buffers, either string [] or pattern [], starting at the position index == 0 of the first character of these two buffers, lexicographically comparing each character in string [] buffer located at the current position index with a corresponding character from the pattern [] buffer until we’ve found a pair of unmatched characters located at the same position. During each loop iteration, the loop counter variable index is incremented and at the end of loop execution, the following variable is assigned to the position value of the first occurrence of a pair of characters from either string [] or pattern [] buffer that don’t actually match. The position value of index variable is actually an absolute value of the starting position for current non-identical fragment being found. Fig. 3. Illustrates the process described above:

In this case, the two characters “T” from string [] buffer and “D” from pattern [] located at the position index == 6 don’t match which indicates that the current non-identical fragment will start at the position index of the string [] buffer. The following code snippet implements the linear search algorithm that aims to find the first occurrence of a pair of characters that don’t actually match starting at the beginning of either the string [] and pattern [] buffers:

C++

// Iterate over two strings string[] and pattern[] comparing each character
// stored in string[] at the position index with a corresponding character
// in pattern[] located in the same position until we’ve found a pair of
// characters that don’t match. Note: At the end of the following loop
// execution the index loop counter variable will be assigned to the
// value of a starting position of the current fragment which differs
// for both string[] and pattern[] strings.
size_t index = 0;
while (string[index] == pattern[index] &&
    string[index] != '\0' && pattern[index] != '\0') index++;

At this point, since we’ve obtained the position value at which the current non-identical fragment starts, to locate this fragment, we also need to compute the value of its final position by performing the offset-based context search to find the starting position of the last longest occurrence of substring obtained from the pattern [] buffer starting at the position index, located at a certain position of the string [] buffer.

Let’s recall that, we basically deal with two special cases in which the non-identical fragments of these two string buffers could be found and located. In the first case, we’re performing the offset-based context search to find non-identical fragments that start at a certain position and can only be found and located in the string [] buffer. The non-identical fragments of this type are typically called – “insertions”. Similarly, we will be performing the similar offset-based context search to find non-identical “insertion” fragments contained only in the pattern [] buffer starting at the same position as in the previous case. Also, there’s another case in which a pair of two non-identical fragments starting at a certain position in either the string [] or pattern [] buffers causing each buffer to have different characters at the same positions, and at the same time, having different lengths. The fragments of this particular type are also called “overwrites” or “dual insertions”.

Now, let’s take a closer look at the algorithm being discussed. At this point, we initially start finding the final position of the first non-identical fragment by performing offset-based context search in the string [] buffer, which is quite similar to the trivial context search but with only one difference that the offset-bases context search is mainly based on performing the number of trivial context searches for each value of the offset loop counter variable.

Now it’s time to demonstrate how to find the final position of the current non-identical fragment from the previous example assuming that the following fragment is the “insertion” fragment persisting only in string [] buffer by performing the mentioned above offset-based context search.

In the following example, to demonstrate the process of the trivial context search performed by the offset-based context search algorithm for each offset value, we’ll initially assume that the final position of the first non-identical fragment can be obviously found by locating the starting position of a substring extracted from the pattern [] buffer having offset variable value equal to 0. In this particular case, using the offset-based context search is equivalent to performing the trivial context search, since we need to execute just one iteration of the outermost loop, for a single value of the variable offset == 0.

At this point, the first thing that we should do is to extract all characters from the substring in the pattern [] buffer starting at the position index == 6 and copy it into previously allocated temporary buffer. For example, if the current non-identical fragment starts at the position index == 6, then finally the temporary buffer will contain the substring “DOG JUMPS OVER A NICE FOX”.

By performing the trivial context search, we need to iterate over the string [] buffer starting at the position index and for each position pos in the string [] buffer assign the values of the current position pos in the string [] buffer and the position of the first character in the temporary buffer to the loop counter variables pos_s and pos_p respectively.

Then, we have to perform loop iteration over the string [] and the temporary buffer comparing each succeeding character in string [] buffer at the position pos_s with a corresponding character from the temporary buffer located at the position pos_p until we’ve found a character located at the position pos_s in the string [] buffer which doesn’t match a character located at the position pos_p in the temporary buffer. During the loop execution, we’re incrementing the values of pos_s and pos_p variables so that at the end of the following loop execution, the variables pos_s and pos_p will be assigned to the values of the positions of characters that don’t match in either the string [] or the temporary buffer.

During the context search described above, we don’t actually perform a search for the entire substring from the pattern [] buffer that exactly matches the substring starting at a certain position pos of the string [] buffer. Instead, we’re performing the context search by a partial match to find the position of last longest occurrence of a character’s sequence matching in both the pattern [] buffer and the string [] buffer starting at certain positions. After we’ve computed the value of longest matching substring length for each position pos in the string [] buffer, we have to determine the position value pos of the substring in string [] buffer for which the value of the substring length is the maximum.

For that purpose, iterating over the string [] buffer for each position pos for which the longest substring’s occurrence position is being found, we’re performing a check if the computed value of the current substring length is greater than the value of the maximum length so far. If so, we assign the length value of the substring at the current position to that variable used to store the currently maximum length value. Similarly, if the length of the current substring being found is greater than the maximum length value, we assign the value of the starting position pos for the current substring to max_pos variable. So that, finally at the end of loop execution, the following variable will be assigned to a value of the position pos of the last longest substring found in string [] buffer matching the substring from the pattern [] buffer starting at the position index. Fig. 4. Illustrates the process discussed above:

As an alternative, we might not copy the sequence of characters starting at the position index into a previously allocated temporary buffer. Instead, since the value of the position index has been obtained, we can treat the sequence of characters in pattern [] buffer located at the positions which are greater or equal to the position index as the substring for which we’ll be finding the occurrence in the string [] buffer. In this case, actually we need to perform the same context search as it has been already discussed above, but assuming that the first character of the temporary buffer is actually located at the position index in the pattern [] buffer. Then, we’re comparing each character in pattern [] buffer starting at the position index with a corresponding character at string [] buffer located at the position pos_s as it has already been discussed.

For example, for each character starting at the character “T” in the string [] buffer we’re performing a check if it’s equal to the character “D” in the substring contained within the pattern [] buffer. During each iteration, we’re similarly comparing the characters “T” and “D”, and then “I” and “D” until we encounter that the character “D” at the position pos_s == 8 in the string [] buffer exactly matches the same character “D” located at the position index in the pattern [] buffer. At this point, we’re comparing succeeding characters starting at the positions pos_s = 8 and pos_p = 6 in both string [] and pattern [] buffers respectively ending up at the position pos_s = 9 and pos_p = 7 since that the two characters “E” and “O” are not the equal characters. Obviously that the following substring contains only one character and has the length of 1. As it has already been discussed above since this is the first value of length, we normally are assigning the value of length of the current substring to the variable max_len = 1. Respectively, we’re assigning the value of the starting position pos of the longest occurrence of the substring to the max_pos variable so that max_pos = 8.

Now we’re proceeding to compare each next character from the string [] buffer with the character located at the position index == 6 at the pattern [] buffer. We’re normally comparing the characters “D” and “E”, “D” and “<space>” until we’ve found that the two characters at the positions pos == 11 and position index == 6 are identical. At this point, we’re similarly comparing each pair of characters starting at the positions pos_s = 10 and pos_e == 6 until we’ve found two characters “L” and “N” that don’t match at the position pos_s == 28 in the string [] buffer and pos_p == 23 in the pattern [] buffer respectively. Notice that all pairs of characters located between the position index and the pos_p in the pattern [] buffer exactly match. Similarly, as has been discussed above, at this point, we need to compute the length of the substring obtained by subtracting the value of index variable from the value of the pos_p variable: pos_p – index = 23 – 6 = 17. As we can see, the length value of the current substring is much greater than the value of the max_len variable being previously obtained. In this case, we assign the value of difference pos_p – index to the max_len variable so that max_len = 17. Also, we assign the value of the position pos of the substring being found to the max_pos variable so that max_pos = pos = 11.

After we have obtained the final position of the first non-identical fragment by locating the last longest occurrence of the substring extracted from the pattern [] buffer in the string [] buffer starting at the position index + offset, now we proceed the iterations by performing the discussed above trivial context search for each succeeding value of the offset variable.

By performing the offset-based context search, for each value of the offset variable during each loop iteration, we’re actually extracting a particular substring from the pattern [] buffer starting at the position index + offset and for each particular string, at the end of each iteration, we’re performing a check if the value of length of the last longest occurrence of the substring being extracted max_len for a specific value of the offset is greater than the length of the substring having the maximum possible length so far. If so, we’re assigning the values of max_len to the chunk_info[0].max_len variable and proceed with another substring for the next value of the offset variable, so that, finally, the chunk_info[0].max_len variable will be assigned to the maximum possible length value of the last occurrence of the substring being extracted from the pattern [] buffer, found in the string [] buffer during the performing of the trivial context search for a specific value of the offset variable. Note that, unlike finding the “insertions” discussed above, in this particular case, by performing the offset-based context search, we need to find the last longest occurrence of the substring located starting at the minimal position, having the minimal distance to the position index. That’s actually why, during each iteration, for each value of the offset variable, we also compare the value of the position for the current substring with the value of the currently minimal position so far. In the other words, if a certain substring has the greater length than the current maximum length so far and at the same time start at the position closer to the position index, the variables chunk_info[0].max_pos and chunk_info[0].max_len are assigned to the values of the position and length of the current substring for the current value of the offset variable.

Since the final position value for the current non-identical fragment has been obtained and assigned to the chunk_info[0].max_pos variable, then we’re performing the check if the current non-identical fragment is actually the “insertion” fragment by comparing the value of the position index with the value of pos_p variable. If these values are equal, then the following non-identical fragment is actually the “insertion” fragment and persists only in the first string [] buffer.

Now at this point, we have to extract the non-identical fragment being found using the values of the position index and the value of final position assigned to the chunk_info[0].max_pos variable by copying the following fragment from the string [] buffer into the separate buffer diff. In this case, the non-identical fragment being found is “TIDE”.

C++

// Allocating diff buffer and copying the current
// non-identical "insertion" fragment into diff buffer
char* diff = new char[chunk_info[0].max_pos - index + 1];
memset((void*)diff, 0x00, chunk_info[0].max_pos - index + 1);
memcpy_s((void*)diff, chunk_info[0].max_pos - index,
    (const void*)&string[index], chunk_info[0].max_pos - index);

// Allocating diff buffer and copying the current
// non-identical "insertion" fragment into diff buffer
char* diff = new char[chunk_info[1].max_pos - index + 1];
memset((void*)diff, 0x00, chunk_info[1].max_pos - index + 1);
memcpy_s((void*)diff, chunk_info[1].max_pos - index,
    (const void*)&pattern[index], chunk_info[1].max_pos - index);

Similarly, at this point, we’ll use the same trivial context algorithm to obtain the value of the final position of the current no-identical fragment contained within the pattern [] buffer starting at the same position index == 6. As we already know, for that purpose, we’ll consider that the string [] buffer is actually a pattern of search. To demonstrate the following process, let’s now use a bit modified “pattern” string that contains the character “T” at the position 28 of the pattern [] buffer.

By performing the context search in the pattern [] buffer, we’re finally locating the last longest occurrence of the substring that is starting at the position pos_p = 28 and consists of just one character “T” within. In this case, the value of the pos_p variable is assigned to the chunk_info[1].max_pos = 28 variable while the value of the substring length is assigned to the variable chunk_info[1].max_len equal to 1.

The code fragment that implements the offset-based context search algorithm discussed above that performs the location of the longest occurrence of the substring:

C++

// Iterating over pattern [] buffer incrementing position
// loop counter variable offset until we reach the end of buffer
for (size_t offset = 0; offset < strlen(pattern); offset++)
{
    long long max_pos = -1, max_len = -1;
    // Iterating over string [] buffer incrementing position
    // loop counter variable pos until we reach the end of buffer
    for (size_t pos = index; string[pos] != '\0'; pos++)
    {
        // Assigning the variable pos_s the value of the current position
        // pos in string [] buffer. Computing the value of the offset in
        // pattern [] buffer from which we're extracting a substring to
        // perform the context search in the string [] buffer.
        size_t pos_s = pos, pos_p = index + offset;
        // If the these both characters are equal we're iterating over
        // pattern [] and string [] buffers starting at positions pos_s
        // and pos_p respectively until we incur a pair of succeeding
        // characters that actually don't actually match.
        while (pos_s < size1 && pos_p < size2 &&
            string[pos_s] == pattern[pos_p] &&
            string[pos_s] != '\0' && pattern[pos_p] != '\0')
        {
            // Incrementing the values of pos_s and pos_p variables.
            pos_s++; pos_p++;
        }

        // Checking if the length of the current substring found in pattern []
        // buffer has the greater value than the maximum value of length so far.
        if (pos_s - pos >= max_len || max_len == -1)
        {
            // If so, we're assigning the value of the current length to max_len
            // as well as assigning the value of the current position to max_pos
            max_pos = pos;
            max_len = pos_s - pos;
        }
    }

    // Checking if the length of the current substring found in pattern []
    // buffer starting at the position offset + index has the greater value
    // than the maximum value of length so far for the current offset value.
    if (max_len > 1 && max_len >= chunk_info[0].max_len ||
                                 chunk_info[0].max_len == -1)
    {
        // If so, we're assigning the value of the current length
       // to chunk_info[0].max_len
        // as well as assigning the value of the current position
       // to chunk_info[0].max_pos
        chunk_info[0].max_len = max_len;
        // Also we're performing the check if the value of position
       // max_pos at this point is less than the minimum position value
        // chunk_info[2].max_pos so far for the current offset value
        if (max_pos < chunk_info[0].max_pos || chunk_info[0].max_pos == -1)
            chunk_info[0].max_pos = max_pos;
    }
}

size_t pos_p1 = index;
// Iterating over the pattern [] buffer and performing the
// linear search to find the first occurrence of the character
// which is equal to the character located at the first position
// of a substring that meets the distance-based minimization criteria
// found in the string [] buffer
while (pattern[pos_p1] != string[chunk_info[0].max_pos] &&
    pattern[pos_p1] != '\0' && chunk_info[0].max_pos > -1) pos_p1++;

Since we’ve finally found the first non-identical fragment of the string [] buffer “TIDE”, now we have to indent both string [] and pattern [] buffers to the position index at which the following fragment has been previously found. For that purpose, we have to copy the contents of the string [] buffer starting at the final position of the non-identical fragment being found to the position index. Otherwise, if the current non-identical fragment is found in the pattern [] buffer, we have to copy the contents of the pattern [] buffer starting at the final position of the non-identical fragment to the same position index in the pattern [] buffer.

C++

// Indenting the string [] buffers by copying the string []
// buffers contents starting at the positions index + <buffer length>
// to the position index
memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
    strlen(diff1)], strlen(&string[index + strlen(diff1)]));

// Indenting the pattern [] buffers by copying the pattern []
// buffers contents starting at the positions index + <buffer length>
// to the position index
memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
    strlen(diff2)], strlen(&pattern[index + strlen(diff2)]));

Let’s now spend more time to discuss another example of finding and locating the “overwrite” non-identical fragments. Suppose we’ve already found a certain non-identical fragment “TIDE” located at the position index == 6 of the string [] buffer. Obviously that, this is an “insertion” fragment. After locating the following fragment, the two string buffers, string [] and pattern [] are indented to the position 0 so that both of them will contain particular strings starting at the positions 6 and 11 respectively:

After those two buffers string [] and pattern [] have been indented, we’re actually performing the same iteration over those two string buffers similarly comparing each pair of characters contained in both string [] and pattern [] buffer until we incur the pair of characters that don’t actually match. In this case, the first pair of the unmatched characters “L” and “N” is obviously located at the position index == 0. So that during the first iteration of the loop at which the value of the offset variable equal to 0, we’re actually extracting the substring “NICE FOX” and copy it into the temporary buffer. Then, we’re performing the trivial context search to find the last longest occurrence of that substring the string [] buffer. Unfortunately, in this particular case, there’s no suitable substring contained in the string [] buffer that would match the substring contained in the temporary buffer. Then we’re proceeding to perform the next loop iteration for the value of variable offset == 1. Similarly, we’re extracting the substring “ICE FOX” from the pattern [] buffer and perform the same trivial context search. In this case, the following substring also cannot be found in the string buffer. After that, we’re proceeding with the substring “CE FOX”, “E FOX” until we incur the substring “ FOX”, for which could be found at the position pos == 5 in the string [] buffer and has the length of max_len == 3. Fig. 6 illustrates the process of finding the two non-identical “overwrite” fragment which could be found in string [] and pattern [] buffers:

Since the position of the current substring has been obtained, we’re treating this position as the position of the final character of that non-identical fragment which we’re actually searching for. In this case, the following character is “<space>”. Then we’re performing the obvious linear search of the following character in pattern [] buffer and locate it at the position pos_p == 4 assuming that the following position is the final position of another non-identical fragment found in the pattern [] buffer. Now, having the starting and final position for both non-identical “overwrite” fragments from either string [] or pattern [] buffers, let’s extract those fragments so that from the string [] buffer we would extract the substring “LARGE” and from the pattern [] buffer – the substring “NICE”.

C++

// Allocating diff1 buffer and copying the current
// non-identical "overwrite" fragment from string [] buffer into diff1 buffer
char* diff1 = new char[chunk_info[0].max_pos - index + 1];
memset((void*)diff1, 0x00, chunk_info[0].max_pos - index + 1);
memcpy_s((void*)diff1, chunk_info[0].max_pos - index,
    (const void*)&string[index], chunk_info[0].max_pos - index);

// Allocating diff2 buffer and copying the current
// non-identical "overwrite" fragment from pattern [] buffer into diff1 buffer
char* diff2 = new char[chunk_info[1].max_pos - index + 1];
memset((void*)diff2, 0x00, chunk_info[1].max_pos - index + 1);
memcpy_s((void*)diff2, chunk_info[1].max_pos - index,
    (const void*)&pattern[index], chunk_info[1].max_pos - index);

We’re performing the loop iteration, proceeding to find the next current non-identical buffers until we incur the entire contents of the string [] and pattern [] buffers become identical as the result of performing the indenting after each non-identical fragment is being found and located. At this point, we similarly start from the beginning of the string [] and pattern [] buffers. Normally, we perform the iterations of the outermost controlling loop until we incur that both string [] and pattern [] buffers are identical or empty having no characters to be compared.

C++

ullong diff_pos[2] = { 0 };
bool parsed = false; int nbuf = 0;
while (strcmp(string, pattern) != 0 && parsed == false)
{
}

Also there’re special cases at which the either string [] or pattern [] buffers doesn’t actually contain any non-identical fragments but are still considered to be different. This normally occurs when string [] contains a fragment missing from pattern [] buffer located at the position beyond of the last character in pattern [] buffer and vice versa. In this case, we should simply extract the fragment located beyond the last position in the pattern [] buffer from the string [] buffer assuming that the following fragment to be non-identical:

C++

// Check if the value of index loop counter variable obtained during
// the discussed above loop execution exceeds the value of actual size
// of the pattern[] string and either is assigned the value of the last
// character in string which means that the string[] contains a fragment
// of text missing from pattern[] located at the position beyond the
// last character in pattern[]. This normally happens due to string[]
// and pattern[] sizes mismatch.
if (index >= strlen(pattern) && pattern[index] == '\0')
{
    // Allocating a diff[] characters buffer to store the current distinct
    // fragment from the string[] buffer starting at the position index.
    char* diff = new char[strlen(string) - index + 1];
    memset((void*)diff, 0x00, strlen(string) - index + 1);
    // Copy the current distinct fragment from string[] starting
    // at the position index to the empty diff[] buffer.
    memcpy_s((void*)diff, strlen(string) - index,
        (const void*)&string[index], strlen(string) - index);

    // Setting the parsed flag to the value of "true" indicating that we've
    // reached the end of either string[] or pattern[] buffer.
    parsed = true;

    // Check if the diff[] buffer is not empty and
    // contains the fragment previously obtained.
    if (strcmp("\0", diff) < 0)
        // Invoking the callback routine passing the values of
        // the contents of the diff[] buffer, absolute position
        // of the current distinct fragment in string[], the size
        // of the diff[] buffer.
        df(diff, diff_pos[0] + index, strlen(string) - index, 0);

    // Increment the value of the current distinct string[] fragment position
    // by the value of the actual fragment's length relative to the beginning
    // of the string[] buffer.
    diff_pos[0] += strlen(string) - index;
}

// Check if the value of index loop counter variable exceeds
// the value of actual size of the string[] string and perform
// similar listed above tasks as it has been previously discussed.
else if (index >= strlen(string) && string[index] == '\0')
{
    // Allocating a diff[] characters buffer to store the current distinct
    // fragment from the pattern[] buffer starting at the position index.
    char* diff = new char[strlen(pattern) - index + 1];
    memset((void*)diff, 0x00, strlen(pattern) - index + 1);
    // Copy the current distinct fragment from pattern[] starting
    // at the position index to the empty diff[] buffer.
    memcpy_s((void*)diff, strlen(pattern) - index,
        (const void*)&pattern[index], strlen(pattern) - index);

    // Setting the parsed flag to the value of "true" indicating that we've
    // reached the end of either string[] or pattern[] buffer.
    parsed = true;

    // Check if the diff[] buffer is not empty and
    // contains the fragment previously obtained.
    if (strcmp("\0", diff) < 0)
        // Invoking the callback routine passing the values of
        // the contents of the diff[] buffer, absolute position
        // of the current distinct fragment in pattern[], the size
        // of the diff[] buffer.
        df(diff, diff_pos[1] + index, strlen(pattern) - index, 1);

    // Increment the value of the current distinct pattern[] fragment position
    // by the value of the actual fragment's length relative to the beginning
    // of the pattern[] buffer.
    diff_pos[1] += strlen(pattern) - index;
}

Another important task that we need to perform while finding and locating each non-identical fragment is the computation of the absolute position for each non-identical fragment being found. Depending on what type of non-identical fragments either the “insertion” or “overwrite” being found using different methods of the position value computation.

Let’s remind that the value of absolute position of a non-identical fragment being found is not the same for either string [] and pattern [] buffer. To compute that value, we use the array diff_pos[2] in which the first item diff_pos[0] will be assigned to the position for each non-identical fragment being found in string [] buffer and the second item diff_pos[1] will be used to store the position of a fragment from pattern []. To compute the position of each “overwrite” fragments found in string [] and pattern [], we simply add the value of the length of each fragment to either diff_pos[0] or diff_pos[1] counter variables respectively, so that the values of diff_pos[0] and diff_pos[1] will contain the position for the next non-identical fragment being found. To determine the position of non-identical “insertion” fragments, first we need to determine in which string buffer either string [] or pattern [] the non-identical “insertion” fragment has been found. If the non-identical “insertion” fragment is found in the string buffer, we increment the value of diff_pos[0] data item by the value of the position index and also the length of the chunks[0] buffer from the array chunks in which the following fragment has been copied. Then we need to increment the value of diff_pos[1] data item by the value of the position index. Similarly, if the non-identical “insertion” fragment was found in the pattern [] buffer, then we’re incrementing the value of diff_pos[1] by the value of the position index and the length of the chunks[1] buffer, and the value of diff_pos[0] is incremented by the value of the position index:

C++

// Incrementing the value of the position in pattern[] buffer by
// value of diff_pos[1] data item by the value of the position index
diff_pos[0] += index;
// Computing the value of the position of the non-identical
// fragment found in the pattern [] buffer incrementing the value
// of diff_pos[1] data item by the values of the fragment length
// and its starting position.
diff_pos[1] += strlen(diff) + index;

// Computing the value of the position of the non-identical
// fragment found in the string [] buffer incrementing the value
// of diff_pos[0] data item by the values of the fragment length
// and its starting position.
diff_pos[0] += strlen(diff1) + index;
// Computing the value of the position of the non-identical
// fragment found in the pattern [] buffer incrementing the value
// of diff_pos[1] data item by the values of the fragment length
// and its starting position.
diff_pos[1] += strlen(diff2) + index;

Using the Code

C++

// TextDiff.cpp : Defines the entry point for the console application.
//

#include "stdafx.h"

#include <stdio.h>
#include <stdlib.h>
#include <conio.h>
#include <string.h>

#define N_CHUNKS 3

#pragma warning(disable:4018)

typedef __int64 ullong;
typedef void(*diffproc)(const char* data, ullong pos, ullong len, unsigned nbuf);

static char** filenames = NULL;

typedef struct tagChunkInfo
{
    ullong max_pos;
    ullong max_len;
} CHUNK_INFO;

template<typename ctype>
void swap(ctype& str1, ctype& str2)
 { ctype temp = str1; str1 = str2; str2 = temp; }

bool loadfromfile(const char* filename, char*& buffer)
{
    FILE* fp = NULL;
    if (fopen_s(&fp, filename, "rb+") != 0)
        return 0;

    fseek(fp, 0, SEEK_END);
    size_t fsize = ftell(fp);
    fseek(fp, 0, SEEK_SET);

    if (buffer == NULL)
    {
        buffer = new char[fsize + sizeof(char)];
        memset((void*)buffer, 0x00, fsize + sizeof(char));
    }

    char ch = '\0'; ullong index = 0;
    while ((ch = fgetc(fp)) != EOF)
        buffer[index++] = ch;

    buffer[index] = '\0';

    fclose(fp);

    return 1;
}

void textdiff(const char* str1, const char* str2, diffproc df)
{
    size_t size1 = strlen(str1);
    size_t size2 = strlen(str2);

    char* string = const_cast<char*>(str1);
    char* pattern = const_cast<char*>(str2);

    ullong diff_pos[2] = { 0 };
    bool parsed = false; int nbuf = 0;
    while (strcmp(string, pattern) != 0 && parsed == false)
    {
        // Iterate over two strings string[] and pattern[] comparing each character
        // stored in string[] at the position index with a corresponding character
        // in pattern[] located in the same position until we’ve found a pair of
        // characters that don’t match. Note: At the end of the following loop
        // execution the index loop counter variable will be assigned to the
        // value of a starting position of the current fragment which differs
        // for both string[] and pattern[] strings.
        size_t index = 0;
        while (string[index] == pattern[index] &&
            string[index] != '\0' && pattern[index] != '\0') index++;

        // Check if the position index obtained is the last position
        // in the string[] and pattern[] buffers
        if (index == strlen(string) - 1 && index == strlen(pattern) - 1)
        {
            // Allocating diff1 buffer and copying the current
            // non-identical "overwrite" fragment from string [] buffer into diff1 buffer
            char* diff1 = new char[sizeof(char) + 1];
            memset((void*)diff1, 0x00, sizeof(char) + 1);

            // Allocating diff2 buffer and copying the current
            // non-identical "overwrite" fragment from pattern [] buffer into diff1 buffer
            char* diff2 = new char[sizeof(char) + 1];
            memset((void*)diff2, 0x00, sizeof(char) + 1);

            parsed = true;

            diff1[0] = string[index]; diff2[0] = pattern[index];

            // Invoking the callback function df passing the parameters of
            // the non-identical fragment contents, the current position of the
            // fragment in string [] buffer and its length. The current position is
            // computed as the previous position stored in diff_pos[0] data item
            // incremented by the value of the starting position index.
            df(diff1, diff_pos[0] + index, strlen(diff1), 0);

            // Invoking the callback function df passing the parameters of
            // the non-identical fragment contents, the current position of the
            // fragment in pattern [] buffer and its length. The current position is
            // computed as the previous position stored in diff_pos[1] data item
            // incremented by the value of the starting position index.
            df(diff2, diff_pos[1] + index, strlen(diff2), 1);

            delete[] diff1; diff1 = NULL;
            delete[] diff2; diff2 = NULL;
        }

        // Check if the value of index loop counter variable obtained during
        // the discussed above loop execution exceeds the value of actual size
        // of the pattern[] string and either is assigned the value of the last
        // character in string which means that the string[] contains a fragment
        // of text missing from pattern[] located at the position beyond the
        // last character in pattern[]. This normally happens due to string[]
        // and pattern[] sizes mismatch.
        else if (index >= strlen(pattern) && pattern[index] == '\0')
        {
            // Allocating a diff[] characters buffer to store the current distinct
            // fragment from the string[] buffer starting at the position index.
            char* diff = new char[strlen(string) - index + 1];
            memset((void*)diff, 0x00, strlen(string) - index + 1);
            // Copy the current distinct fragment from string[] starting
            // at the position index to the empty diff[] buffer.
            memcpy_s((void*)diff, strlen(string) - index,
                (const void*)&string[index], strlen(string) - index);

            // Setting the parsed flag to the value of "true" indicating that we've
            // reached the end of either string[] or pattern[] buffer.
            parsed = true;

            // Check if the diff[] buffer is not empty and
            // contains the fragment previously obtained.
            if (strcmp("\0", diff) < 0)
                // Invoking the callback routine passing the values of
                // the contents of the diff[] buffer, absolute position
                // of the current distinct fragment in string[], the size
                // of the diff[] buffer.
                df(diff, diff_pos[0] + index, strlen(string) - index, 0);

            // Increment the value of the current distinct string[] fragment position
            // by the value of the actual fragment's length relative to the beginning
            // of the string[] buffer.
            diff_pos[0] += strlen(string) - index;

            delete[] diff; diff = NULL;
        }

        // Check if the value of index loop counter variable exceeds
        // the value of actual size of the string[] string and perform
        // similar listed above tasks as it has been previously discussed.
        else if (index >= strlen(string) && string[index] == '\0')
        {
            // Allocating a diff[] characters buffer to store the current distinct
            // fragment from the pattern[] buffer starting at the position index.
            char* diff = new char[strlen(pattern) - index + 1];
            memset((void*)diff, 0x00, strlen(pattern) - index + 1);
            // Copy the current distinct fragment from pattern[] starting
            // at the position index to the empty diff[] buffer.
            memcpy_s((void*)diff, strlen(pattern) - index,
                (const void*)&pattern[index], strlen(pattern) - index);

            // Setting the parsed flag to the value of "true" indicating that we've
            // reached the end of either string[] or pattern[] buffer.
            parsed = true;

            // Check if the diff[] buffer is not empty and
            // contains the fragment previously obtained.
            if (strcmp("\0", diff) < 0)
                // Invoking the callback routine passing the values of
                // the contents of the diff[] buffer, absolute position
                // of the current distinct fragment in pattern[], the size
                // of the diff[] buffer.
                df(diff, diff_pos[1] + index, strlen(pattern) - index, 1);

            // Increment the value of the current distinct pattern[] fragment position
            // by the value of the actual fragment's length relative to the beginning
            // of the pattern[] buffer.
            diff_pos[1] += strlen(pattern) - index;

            delete[] diff; diff = NULL;
        }

        // Otherwise, if the index loop counter variable is assigned to a value
        // that doesn't exceed the values of the final position in string[],
        // pattern[] and the current distinct fragment is located within string[]
        // and pattern[] buffers.
        else
        {
            // Initializing CHUNK_INFO[] buffer to store position and length values
            // of distinct fragments obtained while processing string[] and pattern[] buffers.
            CHUNK_INFO* chunk_info = new CHUNK_INFO[N_CHUNKS];
            memset((void*)chunk_info, 0x00, sizeof(CHUNK_INFO) * N_CHUNKS);

            for (int ci = 0; ci <= N_CHUNKS - 1; ci++)
            {
                chunk_info[ci].max_pos = -1;
                chunk_info[ci].max_len = chunk_info[ci].max_pos;
            }

            // Iterating over pattern [] buffer incrementing position
            // loop counter variable offset until we reach the end of buffer
            for (size_t offset = 0; offset < strlen(pattern); offset++)
            {
                long long max_pos = -1, max_len = -1;
                // Iterating over string [] buffer incrementing position
                // loop counter variable pos until we reach the end of buffer
                for (size_t pos = index; string[pos] != '\0'; pos++)
                {
                    // Assigning the variable pos_s the value of the current position
                    // pos in string [] buffer. Computing the value of the offset in
                    // pattern [] buffer from which we're extracting a substring to
                    // perform the context search in the string [] buffer.
                    size_t pos_s = pos, pos_p = index + offset;
                    // If the these both characters are equal we're iterating over
                    // pattern [] and string [] buffers starting at positions pos_s
                    // and pos_p respectively until we incur a pair of succeeding
                    // characters that don't actually match.
                    while (pos_s < size1 && pos_p < size2 &&
                        string[pos_s] == pattern[pos_p] &&
                        string[pos_s] != '\0' && pattern[pos_p] != '\0')
                    {
                        // Incrementing the values of pos_s and pos_p variables.
                        pos_s++; pos_p++;
                    }

                    // Checking if the length of the current substring found in pattern []
                    // buffer has the greater value than the maximum value of length so far.
                    if (pos_s - pos >= max_len || max_len == -1)
                    {
                        // If so, we're assigning the value of the current length to max_len
                        // as well as assigning the value of the current position to max_pos
                        max_pos = pos;
                        max_len = pos_s - pos;
                    }
                }

                // Checking if the length of the current substring found in pattern []
                // buffer starting at the position offset + index has the greater value
                // than the maximum value of length so far for the current offset value.
                if (max_len > 1 && max_len >= chunk_info[0].max_len || 
                                             chunk_info[0].max_len == -1)
                {
                    // If so, we're assigning the value of the current length 
                   // to chunk_info[0].max_len
                    // as well as assigning the value of the current position 
                   // to chunk_info[0].max_pos
                    chunk_info[0].max_len = max_len;
                    // Also we're performing the check if the value of 
                   // position max_pos at this point is
                    // less than the minimum position value chunk_info[2].max_pos 
                   // so far for the current offset value
                    if (max_pos < chunk_info[0].max_pos || chunk_info[0].max_pos == -1)
                        chunk_info[0].max_pos = max_pos;
                }
            }

            size_t pos_p1 = index;
            // Iterating over the pattern [] buffer and performing the
            // linear search to find the first occurrence of the character
            // which is equal to the character located at the first position
            // of a substring that meets the distance-based minimization criteria
            // found in the string [] buffer
            while (pattern[pos_p1] != string[chunk_info[0].max_pos] &&
                pattern[pos_p1] != '\0' && chunk_info[0].max_pos > -1) pos_p1++;

            // Iterating over pattern [] buffer incrementing position
            // loop counter variable offset until we reach the end of buffer
            for (size_t offset = 0; offset < strlen(string); offset++)
            {
                long long max_pos = -1, max_len = -1;
                // Iterating over string [] buffer incrementing position
                // loop counter variable pos until we reach the end of buffer
                for (size_t pos = index; pattern[pos] != '\0'; pos++)
                {
                    // Assigning the variable pos_s the value of the current position
                    // pos in string [] buffer. Computing the value of the offset in
                    // pattern [] buffer from which we're extracting a substring to
                    // perform the context search in the string [] buffer.
                    size_t pos_s = pos, pos_p = index + offset;
                    while (pos_s < size2 && pos_p < size1 &&
                        pattern[pos_s] == string[pos_p] &&
                        pattern[pos_s] != '\0' && string[pos_p] != '\0')
                    {
                        // Incrementing the values of pos_s and pos_p variables.
                        pos_s++; pos_p++;
                    }

                    // Checking if the length of the current substring found in pattern []
                    // buffer has the greater value than the maximum value of length so far.
                    if (pos_s - pos >= max_len || max_len == -1)
                    {
                        // If so, we're assigning the value of the current length to max_len
                        // as well as assigning the value of the current position to max_pos
                        max_pos = pos;
                        max_len = pos_s - pos;
                    }
                }

                // Checking if the length of the current substring found in pattern []
                // buffer starting at the position offset + index has the greater value
                // than the maximum value of length so far for the current offset value.
                if (max_len > 1 && max_len >= chunk_info[1].max_len || 
                                             chunk_info[1].max_len == -1)
                {
                    // If so, we're assigning the value of the current length to 
                   // chunk_info[0].max_len
                    // as well as assigning the value of the current position to 
                   // chunk_info[0].max_pos
                    chunk_info[1].max_len = max_len;
                    // Also we're performing the check if the value of 
                   // position max_pos at this point is less than the minimum position
                    // value chunk_info[2].max_pos so far for the current offset value
                    if (max_pos < chunk_info[1].max_pos || chunk_info[1].max_pos == -1)
                        chunk_info[1].max_pos = max_pos;
                }
            }

            size_t pos_p2 = index;
            // Iterating over the pattern [] buffer and performing the
            // linear search to find the first occurrence of the character
            // which is equal to the character located at the first position
            // of a substring that meets the distance-based minimization criteria
            // found in the string [] buffer
            while (string[pos_p2] != pattern[chunk_info[1].max_pos] &&
                pattern[pos_p2] != '\0' && chunk_info[1].max_pos > -1) pos_p2++;

            if (pos_p1 == index)
            {
                // Allocating diff buffer and copying the current
                // non-identical "insertion" fragment into diff buffer
                char* diff = new char[chunk_info[0].max_pos - index + 1];
                memset((void*)diff, 0x00, chunk_info[0].max_pos - index + 1);
                memcpy_s((void*)diff, chunk_info[0].max_pos - index,
                    (const void*)&string[index], chunk_info[0].max_pos - index);

                // Invoking the callback function df passing the parameters of
                // the non-identical fragment contents, the current position of the
                // fragment in string [] buffer and its length. The current position is
                // computed as the previous position stored in diff_pos[0] data item
                // incremented by the value of the starting position index.
                df(diff, diff_pos[0] + index, strlen(diff), 0);

                // Allocating temporary buffer temp_s to store the indented string [] buffer
                char* temp_s = new char[size1 + 1];
                memset((void*)temp_s, 0x00, size1 + 1);

                // Indenting the string [] buffers by copying the string []
                // buffers contents starting at the positions index + <buffer length>
                // to the position index
                memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
                    (chunk_info[0].max_pos - index)], 
                   strlen(&string[index + (chunk_info[0].max_pos - index)]));

                // Allocating temporary buffer temp_p to store the indented pattern [] buffer
                char* temp_p = new char[size2 + 1];
                memset((void*)temp_p, 0x00, size2 + 1);

                // Indenting the pattern [] buffers by copying the pattern []
                // buffers contents starting at the positions index + <buffer length>
                // to the position index
                memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
                    (pos_p1 - index)], strlen(&pattern[index + (pos_p1 - index)]));

                // Checking if the temp_s and temp_p buffers were initialized
                if (temp_s != NULL && temp_p != NULL)
                {
                    // If so, copying the temp_s [] buffer into the string [] buffer
                    strcpy_s(string, size1 + 1, temp_s);
                    // If so, copying the temp_p [] buffer into the pattern [] buffer
                    strcpy_s(pattern, size2 + 1, temp_p);

                    delete[] temp_s; temp_s = NULL;
                    delete[] temp_p; temp_p = NULL;
                }

                // Computing the value of the position of the non-identical
                // fragment found in the string [] buffer incrementing the value
                // of diff_pos[0] data item by the values of the fragment length
                // and its starting position.
                diff_pos[0] += strlen(diff) + index;
                // Incrementing the value of the position in pattern [] buffer by
                // value of diff_pos[1] data item by the value of the position index
                diff_pos[1] += index;

                delete[] diff; diff = NULL;
            }

            else if (pos_p2 == index)
            {
                // Allocating diff buffer and copying the current
                // non-identical "insertion" fragment into diff buffer
                char* diff = new char[chunk_info[1].max_pos - index + 1];
                memset((void*)diff, 0x00, chunk_info[1].max_pos - index + 1);
                memcpy_s((void*)diff, chunk_info[1].max_pos - index,
                    (const void*)&pattern[index], chunk_info[1].max_pos - index);

                // Invoking the callback function df passing the parameters of
                // the non-identical fragment contents, the current position of the
                // fragment in string [] buffer and its length. The current position is
                // computed as the previous position stored in diff_pos[0] data item
                // incremented by the value of the starting position index.
                df(diff, diff_pos[1] + index, strlen(diff), 1);

                // Allocating temporary buffer temp_s to store the indented string [] buffer
                char* temp_s = new char[size1 + 1];
                memset((void*)temp_s, 0x00, size1 + 1);
                
                // Indenting the string [] buffers by copying the string []
                // buffers contents starting at the positions index + <buffer length>
                // to the position index
                memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
                    (pos_p2 - index)], strlen(&string[index + (pos_p2 - index)]));

                // Allocating temporary buffer temp_p to store the indented pattern [] buffer
                char* temp_p = new char[size2 + 1];
                memset((void*)temp_p, 0x00, size2 + 1);

                // Indenting the pattern [] buffers by copying the pattern []
                // buffers contents starting at the positions index + <buffer length>
                // to the position index
                memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
                    (chunk_info[1].max_pos - index)], 
                    strlen(&pattern[index + (chunk_info[1].max_pos - index)]));
                
                // Checking if the temp_s and temp_p buffers were initialized
                if (temp_s != NULL && temp_p != NULL)
                {
                    // If so, copying the temp_s [] buffer into the string [] buffer
                    strcpy_s(string, size1 + 1, temp_s);
                    // If so, copying the temp_p [] buffer into the pattern [] buffer
                    strcpy_s(pattern, size2 + 1, temp_p);

                    delete[] temp_s; temp_s = NULL;
                    delete[] temp_p; temp_p = NULL;
                }

                // Incrementing the value of the position in pattern[] buffer by
                // value of diff_pos[1] data item by the value of the position index
                diff_pos[0] += index;
                // Computing the value of the position of the non-identical
                // fragment found in the pattern [] buffer incrementing the value
                // of diff_pos[1] data item by the values of the fragment length
                // and its starting position.
                diff_pos[1] += strlen(diff) + index;

                delete[] diff; diff = NULL;
            }

            else
            {
                // Allocating diff1 buffer and copying the current
                // non-identical "overwrite" fragment from string [] buffer into diff1 buffer
                char* diff1 = new char[chunk_info[0].max_pos - index + 1];
                memset((void*)diff1, 0x00, chunk_info[0].max_pos - index + 1);
                memcpy_s((void*)diff1, chunk_info[0].max_pos - index,
                    (const void*)&string[index], chunk_info[0].max_pos - index);

                // Invoking the callback function df passing the parameters of
                // the non-identical fragment contents, the current position of the
                // fragment in string [] buffer and its length. The current position is
                // computed as the previous position stored in diff_pos[0] data item
                // incremented by the value of the starting position index.
                df(diff1, diff_pos[0] + index, strlen(diff1), 0);

                // Allocating diff2 buffer and copying the current
                // non-identical "overwrite" fragment from pattern [] buffer into diff1 buffer
                char* diff2 = new char[chunk_info[1].max_pos - index + 1];
                memset((void*)diff2, 0x00, chunk_info[1].max_pos - index + 1);
                memcpy_s((void*)diff2, chunk_info[1].max_pos - index,
                    (const void*)&pattern[index], chunk_info[1].max_pos - index);

                // Invoking the callback function df passing the parameters of
                // the non-identical fragment contents, the current position of the
                // fragment in pattern [] buffer and its length. The current position is
                // computed as the previous position stored in diff_pos[1] data item
                // incremented by the value of the starting position index.
                df(diff2, diff_pos[1] + index, strlen(diff2), 1);

                // Allocating temporary buffer temp_s to store the indented string [] buffer
                char* temp_s = new char[size1 + 1];
                memset((void*)temp_s, 0x00, size1 + 1);

                // Indenting the string [] buffers by copying the string []
                // buffers contents starting at the positions index + <buffer length>
                // to the position index
                memcpy_s((void*)temp_s, size1 + 1, (const void*)&string[index + \
                    strlen(diff1)], strlen(&string[index + strlen(diff1)]));

                // Allocating temporary buffer temp_p to store the indented pattern [] buffer
                char* temp_p = new char[size2 + 1];
                memset((void*)temp_p, 0x00, size2 + 1);

                // Indenting the pattern [] buffers by copying the pattern []
                // buffers contents starting at the positions index + <buffer length>
                // to the position index
                memcpy_s((void*)temp_p, size2 + 1, (const void*)&pattern[index + \
                    strlen(diff2)], strlen(&pattern[index + strlen(diff2)]));

                // Checking if the temp_s and temp_p buffers were initialized
                if (temp_s != NULL && temp_p != NULL)
                {
                    // If so, copying the temp_s [] buffer into the string [] buffer
                    strcpy_s(string, size1 + 1, temp_s);
                    // If so, copying the temp_p [] buffer into the pattern [] buffer
                    strcpy_s(pattern, size2 + 1, temp_p);

                    delete[] temp_s; temp_s = NULL;
                    delete[] temp_p; temp_p = NULL;
                }

                // Computing the value of the position of the non-identical
                // fragment found in the string [] buffer incrementing the value
                // of diff_pos[0] data item by the values of the fragment length
                // and its starting position.
                diff_pos[0] += strlen(diff1) + index;
                // Computing the value of the position of the non-identical
                // fragment found in the pattern [] buffer incrementing the value
                // of diff_pos[1] data item by the values of the fragment length
                // and its starting position.
                diff_pos[1] += strlen(diff2) + index;

                delete[] diff1; diff1 = NULL;
                delete[] diff2; diff2 = NULL;
            }

            delete[] chunk_info; chunk_info = NULL;
        }
    }
}

void df(const char* data, ullong pos, ullong len, unsigned nbuf)
{
    printf("\nfile: %s pos = %llu len = %llu\n", filenames[nbuf], pos, len);
    printf("=======================================================================\n");
    printf("%s\n", data);
}

int main(int argc, char* argv[])
{
    if ((argc < 3) || (*argv[1] == '\0') || (*argv[2] == '\0'))
    {
        printf("Usage: textdiff <inp_file1> <inp_file2>\n");
        return 0;
    }

    if (filenames == NULL)
    {
        filenames = new char*[2];
        memset((void*)filenames, 0x00, sizeof(char*) * 2);
    }

    int arg_i = 1;
    for (int fi = 0; fi < 2; fi++)
    {
        filenames[fi] = new char[266];
        strcpy_s(filenames[fi], 266, argv[arg_i++]);
    }

    char *string = NULL, *pattern = NULL;
    if (!loadfromfile(argv[1], string))
    {
        printf("Unable to read from file %s\n", "");
        return -1;
    }

    if (!loadfromfile(argv[2], pattern))
    {
        printf("Unable to read from file %s\n", "");
        return -1;
    }

    textdiff(string, pattern, df);

    return 0;
}

Points of Interest

The smart textdiff utility algorithm being discussed can be an alternative of the linux/unix diff utility for Windows platform.

History

30^th April, 2016 – First version of article published

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Arthur V. Ratz

Software Developer (Senior) EpsilonDev

Ukraine

I’m software developer, system analyst and network engineer, with over 20 years experience, graduated from L’viv State Polytechnic University and earned my computer science and information technology master’s degree in January 2004. My professional career began as a financial and accounting software developer in EpsilonDev company, located at L’viv, Ukraine. My favorite programming languages - C/C++, C#.NET, Java, ASP.NET, Node.js/JavaScript, PHP, Perl, Python, SQL, HTML5, etc. While developing applications, I basically use various of IDE’s and development tools, including Microsoft Visual Studio/Code, Eclipse IDE for Linux, IntelliJ/IDEA for writing code in Java. My professional interests basically include data processing and analysis algorithms, artificial intelligence and data mining, system analysis, modern high-performance computing (HPC), development of client-server web-applications using various of libraries, frameworks and tools. I’m also interested in cloud-computing, system security audit, IoT, networking architecture design, hardware engineering, technical writing, etc. Besides of software development, I also admire to write and compose technical articles, walkthroughs and reviews about the new IT- technological trends and industrial content. I published my first article at CodeProject in June 2015.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.