## Introduction

KMeans is a simple clustering algorithm, which calculates the distance between samples and centroid to generate clusters. K is the number of clusters given by user. At the initial time, the K clusters are chosen randomly, they are adjusted at each iteration to get the optimal results. The centroid is the mean of samples in its corresponding cluster, thus, the algorithm is called K"Means".

## KMeans Model

### KMeans

The normal KMeans algorithm is really simple. Denote the K clusters as , and the number of samples in each cluster as . The loss function of KMeans is:

.

Calculate the derivative of loss function:

Then, make the derivative equal to 0, we can obtain:

namely, the centroid is the means of samples in its corresponding cluster. The code of KMeans is shown below:

def kmeans(self, train_data, k): sample_num = len(train_data) distances = np.zeros([sample_num, 2]) # (index, distance) centers = self.createCenter(train_data) centers, distances = self.adjustCluster(centers, distances, train_data, self.k) return centers, distances

where `adjustCluster()`

is to adjust centroid after determining the initial centroids, which aims to minimize the loss function. The code of `adjustCluster`

is:

def adjustCluster(self, centers, distances, train_data, k): sample_num = len(train_data) flag = True # If True, update cluster_center while flag: flag = False d = np.zeros([sample_num, len(centers)]) for i in range(len(centers)): # calculate the distance between each sample and each cluster center d[:, i] = self.calculateDistance(train_data, centers[i]) # find the minimum distance between each sample and each cluster center old_label = distances[:, 0].copy() distances[:, 0] = np.argmin(d, axis=1) distances[:, 1] = np.min(d, axis=1) if np.sum(old_label - distances[:, 0]) != 0: flag = True # update cluster_center by calculating the mean of each cluster for j in range(k): current_cluster = train_data[distances[:, 0] == j] # find the samples belong # to the j-th cluster center if len(current_cluster) != 0: centers[j, :] = np.mean(current_cluster, axis=0) return centers, distances

### Bisecting KMeans

Because KMeans may get a local optimation result, to solve the problem, we introduce another KMeans algorithm called bisecting KMeans. In bisecting KMeans, all the samples are regarded as a cluster at initial, and the cluster is divided into two parts. Then, choose one part of the divided cluster to bisect again and again till the number of cluster is K. We bisect the cluster according to minimum **Sum of Squared Error**(SSE). Denote the current * n* clusters as:

We choose a cluster in and bisect it into two parts using normal KMeans. The SSE is:

We choose the which can get a minimum of SSE as the cluster to be bisected, namely:

Repeat the above processes till the number of clusters is K.

def biKmeans(self, train_data): sample_num = len(train_data) distances = np.zeros([sample_num, 2]) # (index, distance) initial_center = np.mean(train_data, axis=0) # initial cluster #shape (1, feature_dim) centers = [initial_center] # cluster list # clustering with the initial cluster center distances[:, 1] = np.power(self.calculateDistance(train_data, initial_center), 2) # generate cluster centers while len(centers) < self.k: # print(len(centers)) min_SSE = np.inf best_index = None best_centers = None best_distances = None # find the best split for j in range(len(centers)): centerj_data = train_data[distances[:, 0] == j] # find the samples belonging # to the j-th center split_centers, split_distances = self.kmeans(centerj_data, 2) split_SSE = np.sum(split_distances[:, 1]) ** 2 # calculate the distance # for after clustering other_distances = distances[distances[:, 0] != j] # the samples don't belong # to j-th center other_SSE = np.sum(other_distances[:, 1]) ** 2 # calculate the distance # don't belong to j-th center # save the best split result if (split_SSE + other_SSE) < min_SSE: best_index = j best_centers = split_centers best_distances = split_distances min_SSE = split_SSE + other_SSE # save the spilt data best_distances[best_distances[:, 0] == 1, 0] = len(centers) best_distances[best_distances[:, 0] == 0, 0] = best_index centers[best_index] = best_centers[0, :] centers.append(best_centers[1, :]) distances[distances[:, 0] == best_index, :] = best_distances centers = np.array(centers) # transform form list to array return centers, distances

### KMeans++

Because the initial centroid has great effect on the performance of KMeans, to solve the problem, we introduce another KMeans algorithm called KMeans++. Denote the current * n *clusters as:

When we choose the (n+1)-th centroid, the farther from the existing centroids the sample is, the more probable it will be chosen as the new centroid. First, we calculate the minimum distance between each sample and the existing centroids:

Then, calculate the probability of each sample to be chosen as the next centroid:

Then, we use roulette wheel selection to get the next centroid. After determining K clusters, run `adjustCluster()`

to adjust the result.

def kmeansplusplus(self,train_data): sample_num = len(train_data) distances = np.zeros([sample_num, 2]) # (index, distance) # randomly select a sample as the initial cluster initial_center = train_data[np.random.randint(0, sample_num-1)] centers = [initial_center] while len(centers) < self.k: d = np.zeros([sample_num, len(centers)]) for i in range(len(centers)): # calculate the distance between each sample and each cluster center d[:, i] = self.calculateDistance(train_data, centers[i]) # find the minimum distance between each sample and each cluster center distances[:, 0] = np.argmin(d, axis=1) distances[:, 1] = np.min(d, axis=1) # Roulette Wheel Selection prob = np.power(distances[:, 1], 2)/np.sum(np.power(distances[:, 1], 2)) index = self.rouletteWheelSelection(prob, sample_num) new_center = train_data[index, :] centers.append(new_center) # adjust cluster centers = np.array(centers) # transform form list to array centers, distances = self.adjustCluster(centers, distances, train_data, self.k) return centers, distances

## Conclusion and Analysis

Indeed, it is necessary to adjust clusters after determining how to choose the parameter 'K', there exist some algorithms. Finally, let's compare the performances of three kinds of clustering algorithms.

It can be found that KMeans++ has the best performance.

The related code and dataset in this article can be found in MachineLearning.

## History

- 3
^{rd}June, 2019: Initial version