## Introduction

Naive Bayes is a kind of classification based on Bayesian decision theory and feature conditional independence, which calculates the probability distribution based on conditional independence on training set as the detection model. For a given test object, the label of the maximum of the posterior probability is the prediction of the test object. Maximize the posterior probability means minimizing the expected risk. Then another question is why call it "**Naive**" Bayes? This is because Naive Bayes follow such a naive hypothesis:** all the features for classification are independent when the label is definitized**, which is given by:

where *x*^{(j)} is the i-th feature, *c*_{k} is the *k*-th

label. Then, the Bayes classifier can be defined as:

So why maximize the posterior probability means minimizing the expected risk ? Let the loss is 0-1 loss function is

where `f(x)`

is the decision function. Then, the expected risk is

which is calculated from joint distribution `P(X,Y)`

. Thus the conditional expectation is:

To minimize the expected risk, it needs to minimize each *X = x*

, namely:

## Naive Bayes Model

Naive Bayes model consists of parameters estimation and classify.

### Parameters Estimation

In the training process, learning means estimate the prior probability and conditional probability. **Maximum likelihood estimation** (MLE) is a general method to get the above parameters. The MLE of prior probability is given by:

Denote the` `

`j-th`

feature set is` `

{*a*_{j1},*a*_{j2},...,*a*_{jsi}}.Then, the MLE of conditional probability is given by:

In the Naive Bayes training process, the prior probability and conditional probability is calculated. However, if a value of a feature has never occurred in the training set, it's probability is equal to zero, which will effect the result of posterior probability. To solve the problem, we introduce **Laplace smoothing**: **add an integer to the frequency of each random variable**.

Then, the Bayesian estimation of prior probability is:

where *N*

is the number of unique labels, the *K *

is the number of samples. The code of prior probability is shown below:

prior_probability = {}
for key in range(len(label_value)):
prior_probability[label_value[key][0]] =
(label_value[key][1] + self.laplace) / (N + K * self.laplace)
self.prior_probability = prior_probability

where ` label_value `

is the tuple of ` (label, label_num)`

.

Similarly, the Bayesian estimation of conditional probability is:

The code of conditional probability is shown below. A matrix is applied to save the conditional probability and` S[j]`

is the number of unique labels of the `j-th`

feature.

prob = []
for j in range(feature_dim):
count = np.zeros([S[j], len(label_count)])
feature_temp = train_data[:, j]
feature_value_temp = feature_value[j]
for i in range(len(feature_temp)):
for k in range(len(feature_value_temp)):
for t in range(len(label_count)):
if feature_temp[i] == feature_value_temp[k]
and train_label[i] == label_value[t][0]:
count[k][t] += 1
for m in range(len(label_value)):
count[:, m] = (count[:, m] + self.laplace) /
(label_value[m][1] + self.laplace*S[j])
prob.append(count)
self.conditional_probability = prob

### Classify

After calculating the prior probability and conditional probability, the Bayesian classification model is:

The classification code is shown below. The `predict`

is a dictionary which includes the probability of each label. Then we just need to sort the `predict `

and the prediction is the first element in the sorted dictionary.

def classify(self, sample):
predict = {}
for m in range(len(self.label_value)):
temp = self.prior_probability
[self.label_value[m][0]]
for n in range(len(sample)):
if sample[n] in self.feature_value[n]:
index = np.where(self.feature_value[n] == sample[n])[0][0]
temp = temp * self.conditional_probability[n][index][m]
else:
temp = self.laplace /
(self.S[n] * self.laplace)
predict[self.label_value[m][0]] = temp
return predict

## Conclusion and Analysis

The Bayesian model is this article is Berniulli Bayesian model. Except that, there are other Bayesian model such as Guassian Bayesian model, Polynomial Bayesian model. Finally, let's compare our Bayesian model with the Bayes model in Sklearn and the detection performance is displayed below:

It is found that both methods achieve poor detection results. Moreover, our Bayesian model takes a longer runtime, which may be that the algorithm of conditional probability contains too many loops.

The related code and dataset in this article can be found in MachineLearning.