A Fast Diff Algorithm in Visual Basic .NET

akjohnston

4.24/5 (12 votes)

Feb 9, 2005

13 min read

149348

1345

An implementation of a Diff algorithm in VB.NET, with various techniques to improve performance, while keeping the code simple.

Download source - 43.5 Kb

Sample Image

Introduction

This article describes a Visual Basic .NET implementation of a diff algorithm, used to compare two text streams or similar lists of objects, and produce an "edit script" which explains how to get from the old to the current by deleting, inserting, or copying lines. Instead of just re-implementing the classic C algorithm, this version exploits the power of VB.NET to keep the code simple, and uses a couple of novel tricks to reduce the amount of processing required, improving the execution speed.

Background

A while ago, I was looking for an algorithm for a "diff" type program, and found: A Generic - Reusable Diff Algorithm in C#, by Aprenot. I wanted to understand the underlying algorithm, as most such programs are written in C or its variants, and I wanted to use VB. Aprenot's article is a model of clarity, focused on what the code was doing rather than how, and I recommend it to anyone trying to understand what diff algorithms actually do.

In a follow-up article: A Generic, Reusable Diff Algorithm in C# - II, Michael Potter highlighted some concerns about Aprenot's algorithm, and in particular, its need to build a table of the results of comparing all pairs of previous and current items, which could be very costly in both processing time and memory. I took Michael's concerns to heart, but developed a different approach, explained here.

My version of this algorithm calculates a hash code for each line of text. At each stage, the algorithm looks for candidate matches using the hash tables, checks these for an actual match, and then finds the length of the diagonal sequence. Each diagonal sequence is held in a temporary collection so that the longest can then be easily identified. This way, the number of actual matches performed is vastly reduced, and it's never necessary to build a complete table of matches.

As I was only concerned about comparing text files, my implementation is less "generic" than the others. However, it would be a relatively simple task to create a generic interface similar to my "text line" class, and change the "text line" class to Implement this with an IComparable-type interface for comparisons.

Overview of the Algorithm

For a very clear description of how the core diff algorithm works, see Aprenot's previous article. This is the "potted" version...

The following figure shows two arrays of objects, in this case, characters, but they could be complete text lines or other objects. They are laid out along the axes of a matrix, with the "previous" (original) text on the top and the "current" (new) text on the side. The black squares indicate cases where "previous" and "current" items are the same.

We want to generate an "edit script" which tells us what actions to perform on the "previous" text to transform it to the "current" text. This will effectively be a set of instructions on how to navigate from the top left box to the bottom right box. At each stage, we can do one of three things:

Delete one or more items from the previous text. As we do so, we can move our "working position" the appropriate number of positions to the right.
Insert one or more items of the current text. As we do so, we can move our "working position" the appropriate number of positions down.
Copy one or more items of the previous text across unchanged. As we do so, we navigate diagonally down and to the right, following a diagonal line of black squares.

The most efficient (i.e., shortest) edit script will use diagonal (copy) sequences as much as possible. The way the algorithm works is that from a given starting position, we scan right and down, looking for the beginning of a diagonal sequence. If we find one or more, we move to its starting position (by deleting existing items or inserting current ones as appropriate), and then issue an appropriate "copy" command. If there's no diagonal sequence to the right or below our current position, we just delete the previous item for our column, insert the current one for our row, move diagonally down one place, and start again.

When we reach the right hand or bottom edge, we do one final command to get to the bottom right position.

Performance Improvements in the VB Version

Aprenot's version is a very straightforward implementation of this algorithm. Having built two arrays of objects, it then builds a large in-memory table representing all the comparisons, and navigates it to generate the edit script.

Michael Potter correctly recognised that this would be memory- and processor- intensive, and refined Aprenot's implementation using a "sliding window" technique which reduced the number of comparisons and the space required for the comparison table.

However, I felt that this still required too many comparisons, and came up with a different strategy:

As each item is loaded into the array, we derive a "hash code", a number representing in some way (the details are not important) the contents of the item. The only requirement for the hash code is that it should have the same value for identical items, without too many "false positives", and should be cheap to calculate.
We create a "hash table" which allows us to look up a given hash value and find pointers to all the matching "previous" and "current" items.
From a given starting position in the matrix, we look up the hash code of the current column, and use the hash table to find any potential matching rows in the same column (i.e., possible black squares at or below our position). We check any potential matches by doing actual comparisons, and when these succeed, we follow the path diagonally down doing real comparisons until the line of matches ends. We record the position and length of each diagonal sequence in a temporary table.
From the same starting position, we repeat this process to find any match sequences starting on the same row.
We scan the list of matches we've built to find the longest, and follow it. If there are no matches, we move diagonally down.
We throw away the table of matches and go again.

The hash table means we never do any initial comparison which doesn't have a good chance of succeeding. Following diagonal sequences from good starting points means we are really only "walking the black lines". In real-world cases where large text files have some long identical sections, the number of comparisons used in this strategy is a tiny fraction of the possible total.

My code makes substantial use of VB collections. They are a convenient way of building indexed arrays of objects, and can be destroyed just by removing all references, which reduces the amount of code. Also, I hold each version of the text in a CText class, so to do repeated comparisons on successive versions of the same text, I can simply point the "previous" text to the "current" CText object from last time, create a new "current" CText class, and start again.

Following the code

The example is compiled as a simple EXE file, with a simple test harness form, but it's very easy to convert to an ActiveX DLL - just get rid of the forms, change the project type, and make sure CTextProcessor is public. The following sections describe the forms and classes one at a time in order of execution.

frmDiffTest

This is a simple form which shows the current version of some text, the previous version, the changes (edit script), and the "reconstituted" text which you get if you apply the edit script to the previous version - this should, of course, be identical to the current version if things are working properly.

The code is almost trivial - an object of type CTextProcessor is instantiated when the form is opened, and the cmdProcess_Click method calls its public properties and methods to set the "previous" text, get the edit script required to transform to the "current" text, and combine this with the previous text to check things are working.

CTextProcessor

This is the controller / facade class. It exposes public variables for the two versions of the text and the difference script. These map to three private CText objects which manage and parse the text. A neater version of the code would use property procedures for this, but the existing version is adequate for most purposes. A private collection called Matches holds details of matches found when we're looking for diagonal paths (see above). A set of string constants controls the format of the edit script, and should be fairly self-explanatory.

The class has two public members:

ProcessText is used to generate the edit script. It first sets the "previous" text and CText class to the old "current" values, then creates a new CText object as moCurrentText, and adds the supplied new text to it, which causes the CText object to split the text into lines and build the hash table. The edit script objects are initialised, and the code checks for the trivial case that current and previous text are identical. If not, control passes to the private function ProcessChanges.
ReconstituteText applies a supplied edit script to the "previous" text, which may optionally be overwritten using a second argument. After some initialisation, including splitting the script into lines using a CText object, it first checks the first line of the script for two trivial cases. If these are not found, it then applies each remaining line of the script in turn, moving lines from moPreviousText to a temporary CText object, copying, skipping, or inserting new lines as instructed. Finally, it uses CText.GetFullText to return the reconstituted text.

ProcessChanges() does most of the real work of implementing the algorithm. Given the data structures set up by the other classes, it's a fairly straightforward implementation of the algorithm (see Overview of the Algorithm above):

After checking for the trivial case where there's no previous text (so the change script is just the new text), it then enters a loop which exits when the last lines of both the current and previous text are reached.
The loop first checks for two end cases, where we've got some current text left but no previous text, or vice-versa. These are handled by copying or deleting sufficient lines respectively.
Assuming we are still within both sets of text, we then scan the current row and column for matches. The Matches collection is re-initialised, and the scan is done by two inner loops, one using the current line of the "current" text, and one the current line of the "previous" text. The CText.GetNextLine() method is used to find candidate matches using the hash table. For each candidate match, the code then calls ProcessMatch which walks the diagonal line, recording the length over which the two sets of text strings are actually identical, and adds the match to the Matches collection.
A call to FindBestMatch simply scans the Matches collection and returns the best (longest) match.
Given the best match, the code then navigates to the correct starting point by deleting lines from the previous text or inserting lines from the current text as appropriate, and then adds an instruction to the difference script to copy the appropriate number of previous lines corresponding to the match.
The difference script is built up by adding lines to the moDifferenceScript CText object, and when complete, this is converted to text in the DifferenceScript public variable, using the GetFullText method.

The code of the remaining two private methods, ProcessMatch and FindBestMatch, should be self-explanatory. If you want to make the comparison method more generic, the main changes would be to ProcessMatch.

CText and CTextLine

CText represents input text as a collection of CTextLine objects.

CTextLine is fairly simple, exposing the line number and hash value for the line text as public variables, and the line text itself via a pair of property procedures. The only significant processing is that when the line text is set, it calls a private routine (CalculateHash) to calculate the hash value. The details of CalculateHash are not important, as long as it always produces the same value for the same text, there are a reasonable number of different output values, and it's quick/cheap in processing cycles. I decided to use a string hash value (although a numeric value would work just as well) derived from the string's length and the first and last significant characters (ignoring any NL or CR at the end).

CText exposes the collection of text lines, and four methods:

AddText optionally resets the internal variables, and then adds the specified text. It does this by breaking the text into lines at each carriage return, and then calling AddLine.
AddLine creates a new text line object, sets its text (which will also calculate its hash value), and adds it to the TextLines collection. It then sets the line number for the line, and adds the line to a private hash table collection (a CHashTable object).
GetFullText simply appends the text of all lines in TextLines together in order, to return the text as a single string.
GetNextLine calls the GetNextLine method of the private CHashTable object, returning the next text line which matches a target hash value (if there is one).

CMatch

This is a very simple "data only" class which holds the detail of a match (diagonal sequence) in terms of its starting position in the current and previous text, and its length.

CHashTable

This implements the "hash table" for each CText object. It is actually a "collection of collections of CTextLine objects" keyed first by the hash value, then by line number.

When CText calls the AddTextLine, the latter first checks to see if an item exists in the parent collection (moHashTable) with a key equal to the new line's hash value. If so, then that item is actually a collection of CTextLine objects which will be manipulated. If not, a new collection is created and added to the parent collection. The code then adds the incoming CTextLine object to the collection, with a key equal to the line number, as a string. Since lines are added in the order they occur in the original text, each collection of lines will automatically be in line number order.

GetNextLine takes a hash value and starting line number as input values. It looks in moHashTable to find the collection of matching lines (if any), and then scans it to find the first entry with a line number greater than or equal to the starting line number. If a match is found, it is returned as a line number to the calling code.

Both routines make use of an internal routine DoesItemExist which simplifies the task of finding matches in a collection without unhandled errors.

Error Handling

Most code just passes errors back up the stack. The top-level routine(s) have a Catch section which calls FormErrorHandler which provides for error logging and then displays error details using frmErrorDetails. frmErrorDetails and FormErrorHandler are not required if the code is re-hosted inside a component which passes errors back to its caller.

History

Original version - 4^th Feb 2005.
Please note that this was developed very quickly for a proof of concept prototype. It is certain there will be a few bugs, but hopefully not too many!

Finally

Enjoy, and don't hesitate to feedback any comments.