Assuming that you are only interested in similar words, then you can modify
LevenshteinDistance
so that you will stop comparing 2 strings as soon as the maximum similarity level has been reached.
public static int? LevenshteinDistance(this string s, string t,
int similarityLimit)
{
if (cost > similarityLevel)
{
return null;
}
}
And in calling code:
int limit = 10;
int? similarity = LevenshteinDistance(row1, row2, limit);
bool isSimilar = similarity.HasValue && similarity.Value <= limit;
Assuming that a lot of pairs have a cost much higher than the limit, it will greatly reduce the execution time.
Then since you are only interested to know if
NAME
ar similar when
NUM
are equal, then you might be able to reverse those 2 tests since the second one (a string comparison) should be much faster. Thus you can eliminate most calls to
LevenshteinDistance
assuming that
NUM
are not all equals.
Thus your inner loop might look like this:
// First do the "fast" check
if (dt1.Rows[i]["NUM"].ToString() == dt1.Rows[i]["NUM"].ToString())
{
// Then do the "slow" check...
if (dt1.Rows[i]["Name"].ToString().LevenshteinDistance(
dt1.Rows[j]["Name"].ToString()) <= 10)
{
Logging.Write(...);
}
}
Then assuming that you have more than a few percents of
NUM
that are equals, you can sort items by their
NUM
. One way to do it would be to sort the data returned by the reader using a query like:
SELECT * FROM myTable ORDER BY NUM
If the database is indexed, it might be the faster option. Otherwise, it might be done by using something like
Dictionary<string, List<int>>
where each distinct
NUM
are added to the dictionary and the list is filled with the indexes in the table.
var numDictionary = new Dictionary<string, List<int>>;
for (int i = 0; i < dt1.Rows.Count; i++)
{
var rowNum = dt1.Rows[i]["NUM"].ToString();
List<int> listIndexes;
if (!numDictionary.TryGetValue(rowNum, out listIndexes))
{
list = new List<int>;
numDictionary.Add(rowNum, list);
}
list.Add(i);
}
Then you can compute the
LevenshteinDistance
for each list in the dictionary using nested loop as you have done in the original algorithm.
Note that the internal loop can stop at the index of the external loop. That is, stop just before
j
is equal to
i
.
If the order does not matters, then you can write items in the log directly. If not, fill some structures with the information you wil need to output and sort that structure before outputting the results.