Click here to Skip to main content
14,971,215 members
Articles / Artificial Intelligence / Machine Learning
Article
Posted 27 Aug 2016

Stats

28.7K views
3.3K downloads
13 bookmarked

Finding groups in data with C# - Agglomerative Clustering

Rate me:
Please Sign up or sign in to vote.
4.96/5 (17 votes)
28 Aug 2016CPOL10 min read
This article presents how to implement a well-known agglomerative clustering algorithm in C#.

Introduction

Finding a data clustering in a data set is a challenging task since algorithms usually depend on the adopted inter-cluster distance as well as the employed definition of cluster diameter.

This article describes how to use and implement a well-known agglomerative clustering algorithm named AGNES using the C# programming language. Such algorithm and all the code presented in this article was incorporated in this project: https://github.com/rodrigocostacamargos/SharpCluster.NET

 

Image 1
Figure 1. AggloCluster

AggloCluster (Figure 1) is a Windows Form application that enables users to execute clustering algorithms provided by the SharpCluster.NET library. We will be using this application in order to execute the agglomerative clusering algorithm described in this article. Its source code can be found in: https://github.com/rodrigocostacamargos/AggloCluster.NET

IMPORTANT: Only the code (and its dependencies) described in this article is avaliable to download (SharpCluster.zip). If you want to explore all AggloCluster functionalities you have to download it from AggloCluster/Cluster.NET project on Github.

Background

Machine Learning (ML) is a subfield of Artificial Intelligence (AI) focused mainly on investigations and proposals of new formalisms and computational algorithms, aimed at proving theoretical support as well as implementing automatic learning by computers. Over the past few decades, many ideas on how to enable automatic learning have been presented, discussed and implemented. Among the many ways ML can be implemented, the so called clustering algorithms are a particular group of (unsupervised) algorithms which aims at organizing sets of data points into groups, having data exploratory processes in sight. In the literature one can find numerous works proposing new clustering algorithms as well as different clustering taxonomies. Although taxonomies aim at organizing such algorithms into categories, due to the fact that usually they adopt different criteria for grouping them, they are not necessarily, compatible to each other.

Efficient clustering techniques are considered a challenge, mainly due to the fact that there is no external supervision, which implies knowing nothing about the internal structure of point sets (such as spatial distribution, volume, density, geometric shapes of clusters, etc.). In such a scenery automatic learning becomes an exploratory task, aiming at identifying which are the groups of data points that are statistically separable (or not), which are the most obvious clusters and how they relate to what is aimed at to discriminate, in an attempt to expose the underlying structure of the data, based only on their descriptions, generally given as a vector of attribute values.

In a simplistic way, clustering algorithms try to find groups of data points such that the data points in a group will be similar (or related) to one another and different from (or unrelated to) the data points in other groups. The similarity concept adopted by the clustering algorithm described in this article is a proximity measure defined by the Euclidian distance between two data points (Pi and Pj).

Image 2

So the main goal of a clustering algorithm is to minimize intra-cluster distances and maximize inter-cluster distances, as shown in Figure 2.  

Image 3
Figure 2. Distance between data points in the same cluster and distance between data points in different clusters. Image extracted from Tan, Steinbach and Kumar slides.

 

In order to get a better intuition what clustering really means and what a clustering algorithm does, the section Using the Code explains how to use AggloCluster and SharpCluster.NET library to perform a clustering task. You will be expected to get some clustering results, as shown in figures 3 and 4, while understanding the code behind it. 
 

Image 4
Figure 3 - Clustering result on Aggregation dataset (Average Linkage UPGMA clustering strategy).

Image 5
Figure 4 - Clustering result on Smile dataset (Single Linkage clustering strategy).

 

Figures 3 and 4 show the clustering results on data sets Aggregation.arff and Smile.arff, respectively. Those data set files are available in the AggloCluster.zip file. The clustering strategies (Single Linkage, Complete Linkage and Average Linkage) used by the agglomerative algorithm to produce that results is described in the next section.

 

The AGNES (AGglomerative NESting) Algorithm

For a given a data set containing N data points to be clustered, agglomerative hierarchical clustering algorithms usually start with N clusters (each single data point is a cluster of its own); the algorithm goes on by merging two individual clusters into a larger cluster, until a single cluster, containing all the N data points, is obtained. Obviously, the algorithm can have another stopping criteria, such as that of ending when a clustering containing a user-defined number of clusters (k) is obtained.

Figure 5 presents a high level pseudocode of the AGNES algorithm which, at each iteration, chooses two clusters to be merged, based on the shortest Euclidean distance between the clusters formed so far. The many clustering agglomerative algorithms found in the literature can be organized taking into account the way the inter-cluster distance is defined i.e., what definition is used to compute the shortest distance between all pairs of clusters. Among the various ways, three are particularly popular and are defined next. Given two clusters X and Y let d(x,y) denotes the distance between two data points.

(1) Single Linkage, the distance between two clusters X and Y is the shortest distance between two data points, x belongs to X and y belongs to Y, formally represented by Eq. (1).

Image 6

(2) Complete Linkage, the distance between two clusters X and Y is the farthest distance between two data points, x belongs to X and y belongs to Y, formally represented by Eq. (2).

Image 7

(3) Average Linkage, where the distance between two clusters is the mean distance between data points of each cluster in one cluster to every point in the other cluster, formally represented by Eq. (3).

Image 8

 

Image 9
Figure 5. A customized version of AGNES pseudocode

 

To find the closest pair of clusters (i.e., min distance), AGNES looks for the lowest distance value in the dissimilarity matrix. The dissimilarity matrix (DM) is aways updated because the algorithm, at each iteration, merges the closest pair of clusters into a new cluster and such cluster is a new entry of the matrix. More formally, at each level, t, when two clusters are merged into one, the size of the dissimilarity matrix DM becomes (N - t) x (N - t). DMt follows from DMt-1 by (a) deleting the two rows and columns that correspond to the merged clusters and (b) adding a new row and a new column that contain the distance between newly formed clusters and the old clusters.

The matrix update procedure described before may sound confusing, but understand the code that implement it is not that hard as you will see in the next section. So to clear your mind about how a dissimilarity matrix look like, see tables 1 and 2.

Image 10
Table 1. Data set matrix

Image 11

In order to get the values shown in Table 2, AGNES calculates the Euclidian distance for all data points presented in Table 1.

Image 12
Table 2 - Dissimilarity matrix

Looking at the dissimilarity matrix, you will see that the lowest distance value is 0.41 (not considering the diagonals elements) and AGNES  would start merging data points 2 and 4, forming a new cluster.    

 

Using the code

In this section, we will be using AggloCluster to execute the AGNES algorithm implementation that is provided by the SharpCluster.NET library. So first download SharpCluster.zip, extract the files and compile the source code (it will generate new SharpCluster.dll and SharpCluster.pdb files). Second download AggloCluster.zip, extract the files and replace SharpCluster.dll and SharpCluster.pdb with the new ones. In sequence, execute AggloCluster.exe and do the following steps:

  1. On the Preprocess tab, click in Open file and select any of the dataset files that came with AggloCluster.zip file;
  2. Next, click on the Validation tab and then click on the AGNES tab;  
  3. In sequence, select one of the four clustering strategies from the drop-down list;
  4. Enter the number of clusters (COP.arff has 3 clusters, Aggregation.arff has 7 clusters and  Simle.arff has 4 clusters); 
  5. Finally, click the Start clustering button. Clicking the Start clustering button will execute the ExecuteClustering method. (In case you want debugging, open SharpCluster.csproj in Visual Studio and attach the AggloCluster process to it).  

As mentioned before, the first method to be called by the AggloCluster is the ExecuteClustering located in the Agnes class of the SharpCluster.csproj. 

C#
public Clusters ExecuteClustering(ClusterDistance.Strategy strategy, int k)
 {

     // Step 1
     // build a clustering only with singleton clusters
     this._BuildSingletonCluster();

     // Step2
     //build the dissimilarity matrix
     this._BuildDissimilarityMatrixParallel();

     // Step 3
     // build the hierarchical clustering
     this.BuildHierarchicalClustering(_clusters.Count(), strategy, k);

     return _clusters; //represents the clustering data structure
 }

After creating the initial clusteing formed by singleton clusters (Step 1), the dissimilarity matrix, which contains all the distances between all pair of clusters, is built (Step 2) by performing all the computation needed in parallel. Such computation could be performed sequencially; however, I found feasible to implemented it concurrently in order to improve the overall algorithm's performance.   

C#
private void _BuildDissimilarityMatrixParallel()
{
    double distanceBetweenTwoClusters;
    _dissimilarityMatrix = new DissimilarityMatrix();

    Parallel.ForEach(_ClusterPairCollection(), clusterPair =>
    {
        distanceBetweenTwoClusters = ClusterDistance.ComputeDistance(clusterPair.Cluster1, clusterPair.Cluster2);
        _dissimilarityMatrix.AddClusterPairAndDistance(clusterPair, distanceBetweenTwoClusters);
    });
}

The method _ClusterPairCollection just build the the collection of cluster pair.  The pair of clusters and the distance are the entries of the dissimilarity matrix.

C#
private IEnumerable<ClusterPair> _ClusterPairCollection()
{
    ClusterPair clusterPair;

    for (int i = 0; i < _clusters.Count(); i++)
    {
        for (int j = i + 1; j < _clusters.Count(); j++)
        {
            clusterPair = new ClusterPair();
            clusterPair.Cluster1 = _clusters.GetCluster(i);
            clusterPair.Cluster2 = _clusters.GetCluster(j);

            yield return clusterPair;
        }
    }

}

Before we dive into the main method of the Agnes class it is important to understand how some methods of the DissimilarityMatrix class works. First lets see the method GetClusterPairDistance

C#
 public double GetClusterPairDistance(ClusterPair clusterPair)
 {
    double clusterPairDistance = Double.MaxValue;
    if (_distanceMatrix.ContainsKey(clusterPair))
            clusterPairDistance = _distanceMatrix[clusterPair];
        else
            clusterPairDistance = _distanceMatrix[new ClusterPair(clusterPair.Cluster2, clusterPair.Cluster1)];

        return clusterPairDistance;
}

The _distanceMatrix is a ConcurrentDictionary and ClusterPair is its key. ClusterPair is formed by two Cluster objects (cluster1 and cluster2) and each Cluster object has it own id. So ClusterPair has two ids, therefore the dictionary has two keys. One of the issues here is that ClusterPair can be formed by cluster1.id = 5 and cluster2.id = 6 OR cluster1.id = 6 and cluster2.id = 5. That why we have the if else statement. The other issue is how to implement a dictionary with two keys? The answer that I found wat to override Equals and GetHashCode methods.

C#
public class EqualityComparer : IEqualityComparer<ClusterPair>
  {

      public bool Equals(ClusterPair x, ClusterPair y)
      {
          return x._cluster1.Id == y._cluster1.Id && x._cluster2.Id == y._cluster2.Id;
      }

      public int GetHashCode(ClusterPair x)
      {
          return x._cluster1.Id ^ x._cluster2.Id;
      }
  }

 

The GetClosestPair method of the DissimilarityMatrix class is pretty straightforward code, it just look for the lowest distance value in the matrix. Instead of copying the dictionary values to a List, sorting it and then get the first value or do something like that I prefered using the old fashion way to find a min value in an unsorted array. I have to be honest with you that my first line of this code was a one line lambda expression ignoring its O(n log n) operation (average case). Of course there is a better way to do this like to store the min value when adding/deleting the distance value so the min value could be found in constant time O(1), but I just didn't.

C#
public ClusterPair GetClosestClusterPair()
{
    double minDistance = double.MaxValue;
    ClusterPair closestClusterPair = new ClusterPair();

    foreach (var item in _distanceMatrix)
    {
        if(item.Value < minDistance)
        {
            minDistance = item.Value;
            closestClusterPair = item.Key;
        }
    }

    return closestClusterPair;
}

 

Now lets have a look at the ComputeDistance method of the ClusterDistance class. The trick part of this method is related to the Average Linkage UPGMA distance. We have to keep track of the total quantity of data points in a cluster (diving into the entire cluster hierarchy) and I had to implement a recursive method to do this (there is another opportunity to improve performance here).

C#
public static double ComputeDistance(Cluster cluster1, Cluster cluster2, DissimilarityMatrix dissimilarityMatrix, Strategy strategy)
{
    double distance1, distance2, distance = 0;
    //get the distance between cluster1 and subcluster0 of the cluster2
    distance1 = dissimilarityMatrix.GetClusterPairDistance(new ClusterPair(cluster1, cluster2.GetSubCluster(0)));
    //get the distance between cluster1 and subcluster1 of the cluster cluster2
    distance2 = dissimilarityMatrix.GetClusterPairDistance(new ClusterPair(cluster1, cluster2.GetSubCluster(1)));

    switch (strategy)
    {
        case Strategy.SingleLinkage: distance = _MinValue(distance1, distance2); break;
        case Strategy.CompleteLinkage: distance = _MaxValue(distance1, distance2); break;
        case Strategy.AverageLinkageWPGMA: distance = (distance1 + distance2) / 2; break;
        case Strategy.AverageLinkageUPGMA:
            distance = ((cluster2.GetSubCluster(0).TotalQuantityOfPatterns * distance1) / cluster2.TotalQuantityOfPatterns) + ((cluster2.GetSubCluster(1).TotalQuantityOfPatterns * distance2) / cluster2.TotalQuantityOfPatterns);
            break;
        default: break;
    }

    return distance;

}

 

This next method is responsible for updating the dissimilarity matrix as described at the end of the previous section. It is important to make sure that old distance values as well as old clusters (i.e., merged clusters) is removed.

C#
private void _UpdateDissimilarityMatrix(Cluster newCluster, ClusterDistance.Strategy strategie)
{
    double distanceBetweenClusters;
    for (int i = 0; i < _clusters.Count(); i++)
    {
        // compute the distance between old clusters to the new cluster
        distanceBetweenClusters = ClusterDistance.ComputeDistance(_clusters.GetCluster(i), newCluster, _dissimilarityMatrix, strategie);
        // insert the new cluster's distance
        _dissimilarityMatrix.AddClusterPairAndDistance(new ClusterPair(newCluster, _clusters.GetCluster(i)), distanceBetweenClusters);
        //remove all old distance values of the old clusters (subclusters of the newcluster)
        _dissimilarityMatrix.RemoveClusterPair(new ClusterPair(newCluster.GetSubCluster(0), _clusters.GetCluster(i)));
        _dissimilarityMatrix.RemoveClusterPair(new ClusterPair(newCluster.GetSubCluster(1), _clusters.GetCluster(i)));
    }

    // finally, remove the distance of the old cluster pair
    _dissimilarityMatrix.RemoveClusterPair(new ClusterPair(newCluster.GetSubCluster(0), newCluster.GetSubCluster(1)));
}

 

Finally, the BuildHierarchicalClustering recursive method just call the methods discussed previously and stops when the number of clusters is equals to k (i.e., user-defined number of clusters).   

C#
private void BuildHierarchicalClustering(int indexNewCluster, ClusterDistance.Strategy strategy, int k)
{
    ClusterPair closestClusterPair = this._GetClosestClusterPairInDissimilarityMatrix();
    Cluster newCluster = new Cluster();
    newCluster.AddSubCluster(closestClusterPair.Cluster1);
    newCluster.AddSubCluster(closestClusterPair.Cluster2);
    newCluster.Id = indexNewCluster;
    newCluster.UpdateTotalQuantityOfPatterns();

    _clusters.RemoveClusterPair(closestClusterPair);
    _UpdateDissimilarityMatrix(newCluster, strategy);
    _clusters.AddCluster(newCluster);
    closestClusterPair = null;

    if (_clusters.Count() > k)
        this.BuildHierarchicalClustering(indexNewCluster + 1, strategy, k);

}

 

Actually, before AggloCluster output the clustering results as displayed in figures 3 and 4, one more step need to be done; it is necessary to transform the hierarchical clustering structure (AGNES is a hierarchical clustering algorithm) into flat clusters with k partitions. 

C#
public Cluster[] BuildFlatClustersFromHierarchicalClustering(Clusters clusters, int k)
{
    Cluster[] flatClusters = new Cluster[k];
    for (int i = 0; i < k; i++)
    {
        flatClusters[i] = new Cluster();
        flatClusters[i].Id = i;
        foreach (Pattern pattern in clusters.GetCluster(i).GetAllPatterns())
            flatClusters[i].AddPattern(pattern);
    }

    return flatClusters;
}

 

Points of Interest

Machine learning is a very interesting subject to learn, however, it can be difficult to understand if you don't have a strong background in math/statistics . So to overcome this, I found a way to understand many of the fancy math notation and concepts described in pattern recognition books by doing some code and translating into a language that I could understand. As a result I can say that I have a more intuitive idea on how some unsupervised machine learning algorithms works and I hope, after reading this article, you can have that same intuition too.

If you got interested in the SharpCluster.NET/AggloCluster project, please feel free to join me on Github. There is a lot work to do there from improving the algorithm's performance to build a more friendly UI.    

 

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)

Share

About the Author

Rodrigo Costa Camargos
Software Developer (Senior)
Brazil Brazil
I started doing some code using VB/ASP back in 1996 and in 2001 I moved to the .Net environment. I am interested in backend development and everything with C#, algorithms and data structures.

Comments and Discussions

 
QuestionThank you for sharing ! Pin
Member 1413180518-May-21 20:11
MemberMember 1413180518-May-21 20:11 
QuestionExcellent work !. One question... Pin
alfonso palma moreno22-Jan-20 15:04
Memberalfonso palma moreno22-Jan-20 15:04 
AnswerRegrding AggloClustering implementaion Program Pin
Member 1388502124-Jun-18 8:06
MemberMember 1388502124-Jun-18 8:06 
GeneralRe: Regrding AggloClustering implementaion Program Pin
Rodrigo Costa Camargos20-Aug-18 11:56
MemberRodrigo Costa Camargos20-Aug-18 11:56 
QuestionIs this just for numerical data points? Pin
Member 129998714-May-18 22:13
MemberMember 129998714-May-18 22:13 
AnswerRe: Is this just for numerical data points? Pin
Rodrigo Costa Camargos20-Aug-18 11:46
MemberRodrigo Costa Camargos20-Aug-18 11:46 
QuestionUse different distance Calculation such as Haversine Pin
Member 831247920-Mar-17 22:13
MemberMember 831247920-Mar-17 22:13 
AnswerRe: Use different distance Calculation such as Haversine Pin
Rodrigo Costa Camargos7-Aug-17 10:55
MemberRodrigo Costa Camargos7-Aug-17 10:55 
GeneralMy vote of 5 Pin
Dmitriy Gakh9-Sep-16 19:55
professionalDmitriy Gakh9-Sep-16 19:55 
PraiseMore article like this are welcomed :) Pin
wmjordan29-Aug-16 2:39
professionalwmjordan29-Aug-16 2:39 
GeneralRe: More article like this are welcomed :) Pin
Rodrigo Costa Camargos29-Aug-16 9:35
MemberRodrigo Costa Camargos29-Aug-16 9:35 
GeneralRe: More article like this are welcomed :) Pin
wmjordan29-Aug-16 14:38
professionalwmjordan29-Aug-16 14:38 
QuestionMy vote of 5 Pin
Erik Ha28-Aug-16 22:12
MemberErik Ha28-Aug-16 22:12 
AnswerRe: My vote of 5 Pin
Rodrigo Costa Camargos29-Aug-16 9:39
MemberRodrigo Costa Camargos29-Aug-16 9:39 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.