Identifying Code Similarities (Duplicate Code) Using Levenshtein Distance

Prajnan Das

Rate me:

5.00/5 (7 votes)

25 Jan 2016CPOL4 min read

21.6K

277

Levenshtein distance can be an effective tool to identify code similarity (or duplicate code)

Download source - 3.2 MB

Introduction

Tools to detect duplicate code can add up to code quality effectively. In the context of C#; various tools and options are available to achieve this. Visual Studio’s (Ultimate or Premium) Code Clone Detection and Atomiq’s Code Similarity Finder are few notable options. This tip discusses how to build your own code similarity finder using the principle of Levenshtein distance. Between two strings, it is the measure of single character edits required to change one string to the other. Lesser the value, more similar are the two strings and vice versa. The tip also employs Roslyn, the .NET complier platform to parse the target code. The tool presented here builds a “similarity matrix” for a target C# class, which is essentially a square matrix, who’s each row is the Levenshtein distances between the codes of a method and all methods in the class. The diagonal of the matrix will have 0 entries as it would be distance between the same codes of the same method. The following table depicts the similarity matrix:

Code Similarity Matrix
	Method 1	Method 2	…	Method n
Method 1	0	10		2
Method 2	10	0		12
…
Method n	2	12		0

Here, entries like [1,2]; [2,1], … [1,n];[n,1] essentially have the same values as they are representing the distance between the same pair of strings. This matrix gives a snapshot of the code similarity in a C# class with the entries of lesser values indicating code similarity or duplicate code. The tool discussed here writes this matrix into an Excel file for convenience.

Background

The aim of this tip is not to discuss the implementation of the Levenshtein distance algorithm, but to show how it can effectively be applied for finding code similarities. A sample implementation from the dotnetperls.com is picked and the code is re-used. It also not the intent of the tip to discuss the performance implications of a certain implementation of the algorithm.

Using the Code

The first requirement of the tool is to provide the ability to parse the code of the target class to find all methods. The MethodsAnalyzer class provides this infrastructure. The method Parse of this class takes the string of the code of the target class and uses the Roslyn’s Syntax tree to get all the methods. It builds two dictionaries: indexMap and methodsCache which indexes the method name and method code respectively. These dictionaries are used to build the Code Similarity matrix later.

public void Parse(string code)
{
    //// parse the syntax tree
    var tree = CSharpSyntaxTree.ParseText(code);
    var syntaxRoot = tree.GetRoot();

    //// get all the methods in the syntax tree
    var methods = syntaxRoot.DescendantNodes().OfType<MethodDeclarationSyntax>();
    var index = 0;

    //// loop through all the method and store the method name and code
    foreach (var method in methods)
    {
        var methodName = method.Identifier.ToString();
        var body = method.Body.ToString();

        //// name
        indexMap[index] = methodName;

        //// code
        methodsCache[index++] = body;
    }
}

The Consolidate method loops through the methodsCache to build Code Similarity matrix similarityMatrix. It uses the LevenshteinDistance class’s Compute method which takes the strings of the method bodies of two methods and computes the distance. The Consolidate method does this for each method versus all methods in the class to build the Code Similarity matrix.

public void Consolidate()
{
    //// compute the LevenshteinDistance(s) a method's code vs all other methods'
    similarityMatrix = new int[methodsCache.Count, methodsCache.Count];
    foreach (var index in methodsCache.Keys)
    {
        var referenceCode = methodsCache[index];
        var allCodes = methodsCache.Values.ToList();
        for (var j = 0; j < allCodes.Count; j++)
        {
            similarityMatrix[index, j] = LevenshteinDistance.Compute(referenceCode, allCodes[j]);
        }
    }
}

The Driver project uses the CodeAnalyzer project to build the Code Similarity matrix for a target class and then writes it to an Excel file. The sample code is written in the Program class of the Driver project. The Main method first calls the GenerateSimilarityMatrix method which in turn uses the MethodsAnalyzer’s Parse and Consolidate methods to build the Code Similarity matrix. Then the Main method calls the GenerateReport method to write the matrix into an Excel file. The details of the GenerateSimilarityMatrix and GenerateReport are omitted in the discussion and left in the source code with detailed documentation for the reader.

static void Main(string[] args)
{
    var program = new Program();

    //// create an instance of the MethodsAnalyzer
    var methodsAnalyzer = new MethodsAnalyzer();

    //// generate the similarity matrix
    program.GenerateSimilarityMatrix(args[0], methodsAnalyzer);

    //// generate the report
    program.GenerateReport(args[1], methodsAnalyzer);

    Console.WriteLine("... Analysis completed. Press any key to exit");
    Console.ReadKey();
}

The source code also includes an example target class ContactsProcessor for which the analysis is performed. The target class is hosted in a separated project called Target. Following is a sample out of the tool run:

	Add	Remove	GetAddresses	SendMails	SendMail
`Add`	0	35	177	160	215
`Remove`	35	0	176	154	211
`GetAddresses`	177	176	0	69	122
`SendMails`	160	154	69	0	144
`SendMail`	215	211	122	144	0

The analysis shows that methods Add; Remove and GetAddresses; SendMails have relatively low values of the computed Levenshtein distance and thus are indicative of having similar code.

Points of Interest

The tool just handles a single class. The future work can be to extend it for a project and then for the whole solution. Also, the computed distances can be scaled relative to the maximum in the matrix, which would provide better readability and quick identification of similarity between methods if the matrix is sufficiently large.

History

25^th January, 2016: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Written By

Prajnan Das

Architect Cognizant Technology Solutions US Corp.

United States

This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.