Step-by-Step Guide to Implement Machine Learning X - KMeans

Ryukkkk

4.33/5 (5 votes)

Jun 3, 2019

CPOL

2 min read

5745

Easy to implement machine learning

Introduction

KMeans is a simple clustering algorithm, which calculates the distance between samples and centroid to generate clusters. K is the number of clusters given by user. At the initial time, the K clusters are chosen randomly, they are adjusted at each iteration to get the optimal results. The centroid is the mean of samples in its corresponding cluster, thus, the algorithm is called K"Means".

KMeans Model

KMeans

The normal KMeans algorithm is really simple. Denote the K clusters as $\mu_{1},\mu_{2},...\mu_{k}$ , and the number of samples in each cluster as $N_{1},N_{2},...,N_{k}$ . The loss function of KMeans is:

$J\left(\mu_{1},\mu_{2},...\mu_{k}\right)=\frac{1}{2}\sum_{j=1}^{K}\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)^{2}$ .

Calculate the derivative of loss function:

$\frac{\partial J}{\partial\mu_{j}}=-\sum_{i=1}^{N_{j}}\left(x_{i}-\mu_{j}\right)$

Then, make the derivative equal to 0, we can obtain:

$\mu_{j}=\frac{1}{N_{j}}\sum_{i=1}^{N_{j}}x_{i}$

namely, the centroid is the means of samples in its corresponding cluster. The code of KMeans is shown below:

    def kmeans(self, train_data, k):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])                      # (index, distance)
        centers = self.createCenter(train_data)
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

where adjustCluster() is to adjust centroid after determining the initial centroids, which aims to minimize the loss function. The code of adjustCluster is:

    def adjustCluster(self, centers, distances, train_data, k):
        sample_num = len(train_data)
        flag = True  # If True, update cluster_center
        while flag:
            flag = False
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            old_label = distances[:, 0].copy()
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)
            if np.sum(old_label - distances[:, 0]) != 0:
                flag = True
                # update cluster_center by calculating the mean of each cluster
                for j in range(k):
                    current_cluster = 
                          train_data[distances[:, 0] == j]  # find the samples belong 
                                                            # to the j-th cluster center
                    if len(current_cluster) != 0:
                        centers[j, :] = np.mean(current_cluster, axis=0)
        return centers, distances

Bisecting KMeans

Because KMeans may get a local optimation result, to solve the problem, we introduce another KMeans algorithm called bisecting KMeans. In bisecting KMeans, all the samples are regarded as a cluster at initial, and the cluster is divided into two parts. Then, choose one part of the divided cluster to bisect again and again till the number of cluster is K. We bisect the cluster according to minimum Sum of Squared Error(SSE). Denote the current n clusters as:

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

We choose a cluster $c_{i}$ in $C$ and bisect it into two parts $c_{i1},c_{i2}$ using normal KMeans. The SSE is:

$SSE_{i}=SSE\left(c_{i1},c_{i2}\right)+SSE\left(C-c_{i}\right)$

We choose the $c_{i}$ which can get a minimum of SSE as the cluster to be bisected, namely:

$index =arg\min SSE_{i}$

Repeat the above processes till the number of clusters is K.

    def biKmeans(self, train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])         # (index, distance)
        initial_center = np.mean(train_data, axis=0)  # initial cluster #shape (1, feature_dim)
        centers = [initial_center]                    # cluster list

        # clustering with the initial cluster center
        distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2)

        # generate cluster centers
        while len(centers) < self.k:
            # print(len(centers))
            min_SSE  = np.inf
            best_index = None                                                    
            best_centers = None                                                  
            best_distances = None                                                

            # find the best split
            for j in range(len(centers)):
                centerj_data = train_data[distances[:, 0] == j]   # find the samples belonging
                                                                  # to the j-th center
                split_centers, split_distances = self.kmeans(centerj_data, 2)    
                split_SSE = np.sum(split_distances[:, 1]) ** 2    # calculate the distance 
                                                                  # for after clustering
                other_distances = distances[distances[:, 0] != j] # the samples don't belong 
                                                                  # to j-th center
                other_SSE = np.sum(other_distances[:, 1]) ** 2    # calculate the distance 
                                                                  # don't belong to j-th center

                # save the best split result
                if (split_SSE + other_SSE) < min_SSE:
                    best_index = j                                               
                    best_centers = split_centers                                 
                    best_distances = split_distances                             
                    min_SSE = split_SSE + other_SSE

            # save the spilt data
            best_distances[best_distances[:, 0] == 1, 0] = len(centers)         
            best_distances[best_distances[:, 0] == 0, 0] = best_index           

            centers[best_index] = best_centers[0, :]                           
            centers.append(best_centers[1, :])                                  
            distances[distances[:, 0] == best_index, :] = best_distances        
        centers = np.array(centers)   # transform form list to array
        return centers, distances

KMeans++

Because the initial centroid has great effect on the performance of KMeans, to solve the problem, we introduce another KMeans algorithm called KMeans++. Denote the current n clusters as:

$C=\left\{c_{1},c_{2},...,c_{n}\right\},n<k$

When we choose the (n+1)-th centroid, the farther from the existing centroids the sample is, the more probable it will be chosen as the new centroid. First, we calculate the minimum distance between each sample and the existing centroids:

$D\left(x_{i}\right)=\min\limits_{c_{j}\in C}\left(x_{i}-c_{j}\right)$

Then, calculate the probability of each sample to be chosen as the next centroid:

$p_{i}=\frac{D\left(x_{i}\right)^{2}}{\sum_{x_{i}\in X}D\left(x_{i}\right)^{2}}$

Then, we use roulette wheel selection to get the next centroid. After determining K clusters, run adjustCluster() to adjust the result.

    def kmeansplusplus(self,train_data):
        sample_num = len(train_data)
        distances = np.zeros([sample_num, 2])       # (index, distance)

        # randomly select a sample as the initial cluster
        initial_center = train_data[np.random.randint(0, sample_num-1)]
        centers = [initial_center]

        while len(centers) < self.k:
            d = np.zeros([sample_num, len(centers)])
            for i in range(len(centers)):
                # calculate the distance between each sample and each cluster center
                d[:, i] = self.calculateDistance(train_data, centers[i])

            # find the minimum distance between each sample and each cluster center
            distances[:, 0] = np.argmin(d, axis=1)
            distances[:, 1] = np.min(d, axis=1)

            # Roulette Wheel Selection
            prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2))
            index = self.rouletteWheelSelection(prob, sample_num)
            new_center = train_data[index, :]
            centers.append(new_center)

        # adjust cluster
        centers = np.array(centers)   # transform form list to array
        centers, distances = self.adjustCluster(centers, distances, train_data, self.k)
        return centers, distances

Conclusion and Analysis

Indeed, it is necessary to adjust clusters after determining how to choose the parameter 'K', there exist some algorithms. Finally, let's compare the performances of three kinds of clustering algorithms.

It can be found that KMeans++ has the best performance.

The related code and dataset in this article can be found in MachineLearning.

History

3^rd June, 2019: Initial version