14,970,219 members
Articles / Artificial Intelligence / Machine Learning
Article
Posted 12 May 2019

4.5K views
10 bookmarked

# Step-by-Step Guide To Implement Machine Learning III - Naive Bayes

Rate me:
Easy to implement machine learning

## Introduction

Naive Bayes is a kind of classification based on Bayesian decision theory and feature conditional independence, which calculates the probability distribution based on conditional independence on training set as the detection model. For a given test object, the label of the maximum of the posterior probability is the prediction of the test object. Maximize the posterior probability means minimizing the expected risk. Then another question is why call it "Naive" Bayes? This is because Naive Bayes follow such a naive hypothesis: all the features for classification are independent when the label is definitized, which is given by:

where x(j) is the i-th feature, ck is the k-th label. Then, the Bayes classifier can be defined as:

So why maximize the posterior probability means minimizing the expected risk ? Let the loss is 0-1 loss function is

where f(x) is the decision function. Then, the expected risk is

which is calculated from joint distribution P(X,Y). Thus the conditional expectation is:

To minimize the expected risk, it needs to minimize each X = x, namely:

## Naive Bayes Model

Naive Bayes model consists of parameters estimation and classify.

### Parameters Estimation

In the training process, learning means estimate the prior probability and conditional probability. Maximum likelihood estimation (MLE) is a general method to get the above parameters. The MLE of  prior probability is given by:

Denote the j-th feature set is {aj1,aj2,...,ajsi}.Then, the MLE of conditional probability is given by:

In the Naive Bayes training process, the prior probability and conditional probability is calculated. However, if a value of a feature has never occurred in the training set, it's probability is equal to zero, which will effect the result of posterior probability. To solve the problem, we introduce Laplace smoothing: add an integer to the frequency of each random variable.

Then, the Bayesian estimation of prior probability is:

where N is the number of unique labels, the K is the number of samples. The code of prior probability is shown below:

Python
prior_probability = {}
for key in range(len(label_value)):
prior_probability[label_value[key][0]] =
(label_value[key][1] + self.laplace) / (N + K * self.laplace)  # laplace smooth
self.prior_probability = prior_probability

where  label_value  is the tuple of  (label, label_num).

Similarly, the Bayesian estimation of conditional probability is:

The code of conditional probability is shown below. A matrix is applied to save the conditional probability and S[j] is the number of unique labels of the j-th feature.

Python
# calculate the conditional probability
prob = []
# calculate the count (x = a & y = c)
for j in range(feature_dim):
count = np.zeros([S[j], len(label_count)])  # the range of label start with 1
feature_temp = train_data[:, j]
feature_value_temp = feature_value[j]
for i in range(len(feature_temp)):
for k in range(len(feature_value_temp)):
for t in range(len(label_count)):
if feature_temp[i] == feature_value_temp[k]
and train_label[i] == label_value[t][0]:
count[k][t] += 1             # x = value and y = label
# calculate the conditional probability
for m in range(len(label_value)):
count[:, m] = (count[:, m] + self.laplace) /
(label_value[m][1] + self.laplace*S[j])  # laplace smoothing
# print(count)
prob.append(count)
self.conditional_probability = prob


### Classify

After calculating the prior probability and conditional probability, the Bayesian classification model is:

The classification code is shown below. The predict is a dictionary which includes the probability of each label. Then we just need to sort the predict and the prediction is the first element in the sorted dictionary.

Python
def classify(self, sample):
predict = {}
for m in range(len(self.label_value)):
temp = self.prior_probability
[self.label_value[m][0]]  # get the prior_probability of m-th label in label_value
for n in range(len(sample)):
if sample[n] in self.feature_value[n]:
# print(m, n)
index = np.where(self.feature_value[n] == sample[n])[0][0]
temp = temp * self.conditional_probability[n][index][m]
else:
temp = self.laplace /
(self.S[n] * self.laplace)  # if the value of feature is
# not in training set, return the laplace smoothing
predict[self.label_value[m][0]] = temp
return predict


## Conclusion and Analysis

The Bayesian model is this article is Berniulli Bayesian model. Except that, there are other Bayesian model such as Guassian Bayesian model, Polynomial Bayesian model. Finally, let's compare our Bayesian model with the Bayes model in Sklearn and the detection performance is displayed below:

It is found that both methods achieve poor detection results. Moreover, our Bayesian model takes a longer runtime, which may be that the algorithm of conditional probability contains too many loops.

The related code and dataset in this article can be found in MachineLearning.

## Share

 Engineer Germany
Ryuk is interested in Machine Learning/Signal Processing/VoIP.