Click here to Skip to main content
15,881,852 members
Articles / Programming Languages / C#
Tip/Trick

Identifying Code Similarities (Duplicate Code) Using Levenshtein Distance

Rate me:
Please Sign up or sign in to vote.
5.00/5 (7 votes)
25 Jan 2016CPOL4 min read 21.6K   277   6   4
Levenshtein distance can be an effective tool to identify code similarity (or duplicate code)

Introduction

Tools to detect duplicate code can add up to code quality effectively. In the context of C#; various tools and options are available to achieve this. Visual Studio’s (Ultimate or Premium) Code Clone Detection and Atomiq’s Code Similarity Finder are few notable options. This tip discusses how to build your own code similarity finder using the principle of Levenshtein distance. Between two strings, it is the measure of single character edits required to change one string to the other. Lesser the value, more similar are the two strings and vice versa. The tip also employs Roslyn, the .NET complier platform to parse the target code. The tool presented here builds a “similarity matrix” for a target C# class, which is essentially a square matrix, who’s each row is the Levenshtein distances between the codes of a method and all methods in the class. The diagonal of the matrix will have 0 entries as it would be distance between the same codes of the same method. The following table depicts the similarity matrix:

Code Similarity Matrix
 

Method 1

Method 2

…  

Method n

Method 1

0

10

 

2

Method 2

10

0

 

12

       

Method n

2

12

 

0

Here, entries like [1,2]; [2,1], … [1,n];[n,1] essentially have the same values as they are representing the distance between the same pair of strings. This matrix gives a snapshot of the code similarity in a C# class with the entries of lesser values indicating code similarity or duplicate code. The tool discussed here writes this matrix into an Excel file for convenience.

Background

The aim of this tip is not to discuss the implementation of the Levenshtein distance algorithm, but to show how it can effectively be applied for finding code similarities. A sample implementation from the dotnetperls.com is picked and the code is re-used. It also not the intent of the tip to discuss the performance implications of a certain implementation of the algorithm.

Using the Code

The first requirement of the tool is to provide the ability to parse the code of the target class to find all methods. The MethodsAnalyzer class provides this infrastructure. The method Parse of this class takes the string of the code of the target class and uses the Roslyn’s Syntax tree to get all the methods. It builds two dictionaries: indexMap and methodsCache which indexes the method name and method code respectively. These dictionaries are used to build the Code Similarity matrix later.

C#
public void Parse(string code)
{
    //// parse the syntax tree
    var tree = CSharpSyntaxTree.ParseText(code);
    var syntaxRoot = tree.GetRoot();

    //// get all the methods in the syntax tree
    var methods = syntaxRoot.DescendantNodes().OfType<MethodDeclarationSyntax>();
    var index = 0;

    //// loop through all the method and store the method name and code
    foreach (var method in methods)
    {
        var methodName = method.Identifier.ToString();
        var body = method.Body.ToString();

        //// name
        indexMap[index] = methodName;

        //// code
        methodsCache[index++] = body;
    }
}    

The Consolidate method loops through the methodsCache to build Code Similarity matrix similarityMatrix. It uses the LevenshteinDistance class’s Compute method which takes the strings of the method bodies of two methods and computes the distance. The Consolidate method does this for each method versus all methods in the class to build the Code Similarity matrix.

C#
public void Consolidate()
{
    //// compute the LevenshteinDistance(s) a method's code vs all other methods'
    similarityMatrix = new int[methodsCache.Count, methodsCache.Count];
    foreach (var index in methodsCache.Keys)
    {
        var referenceCode = methodsCache[index];
        var allCodes = methodsCache.Values.ToList();
        for (var j = 0; j < allCodes.Count; j++)
        {
            similarityMatrix[index, j] = LevenshteinDistance.Compute(referenceCode, allCodes[j]);
        }
    }
}

The Driver project uses the CodeAnalyzer project to build the Code Similarity matrix for a target class and then writes it to an Excel file. The sample code is written in the Program class of the Driver project. The Main method first calls the GenerateSimilarityMatrix method which in turn uses the MethodsAnalyzer’s Parse and Consolidate methods to build the Code Similarity matrix. Then the Main method calls the GenerateReport method to write the matrix into an Excel file. The details of the GenerateSimilarityMatrix and GenerateReport are omitted in the discussion and left in the source code with detailed documentation for the reader.

C#
static void Main(string[] args)
{
    var program = new Program();

    //// create an instance of the MethodsAnalyzer
    var methodsAnalyzer = new MethodsAnalyzer();

    //// generate the similarity matrix
    program.GenerateSimilarityMatrix(args[0], methodsAnalyzer);

    //// generate the report
    program.GenerateReport(args[1], methodsAnalyzer);

    Console.WriteLine("... Analysis completed. Press any key to exit");
    Console.ReadKey();
}

The source code also includes an example target class ContactsProcessor for which the analysis is performed. The target class is hosted in a separated project called Target. Following is a sample out of the tool run:

  Add Remove GetAddresses SendMails SendMail
Add 0 35 177 160 215
Remove 35 0 176 154 211
GetAddresses 177 176 0 69 122
SendMails 160 154 69 0 144
SendMail 215 211 122 144 0

The analysis shows that methods Add; Remove and GetAddresses; SendMails have relatively low values of the computed Levenshtein distance and thus are indicative of having similar code.

Points of Interest

The tool just handles a single class. The future work can be to extend it for a project and then for the whole solution. Also, the computed distances can be scaled relative to the maximum in the matrix, which would provide better readability and quick identification of similarity between methods if the matrix is sufficiently large.

History

  • 25th January, 2016: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


Written By
Architect Cognizant Technology Solutions US Corp.
United States United States
This member has not yet provided a Biography. Assume it's interesting and varied, and probably something to do with programming.

Comments and Discussions

 
PraiseInteresting. Pin
Leigh Pointer27-Jan-16 22:20
professionalLeigh Pointer27-Jan-16 22:20 
QuestionJaro-Winkler distance Pin
Jesús Utrera26-Jan-16 1:54
Jesús Utrera26-Jan-16 1:54 
Hi.
I think you can use another methods like Jaro-Winkler distance that's independent of the length of the strings. I have an article (Using Logistic Regression To Compare Strings) of using logistic regression to train the machine using both Levenshtein and Jaro-Winkler that can be useful to you.
GeneralPlease post the code Pin
John Brett25-Jan-16 20:25
John Brett25-Jan-16 20:25 
GeneralRe: Please post the code Pin
Prajnan Das26-Jan-16 5:19
professionalPrajnan Das26-Jan-16 5:19 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.