Introduction

Naive Bayes is a kind of classification based on Bayesian decision theory and feature conditional independence, which calculates the probability distribution based on conditional independence on training set as the detection model. For a given test object, the label of the maximum of the posterior probability is the prediction of the test object. Maximize the posterior probability means minimizing the expected risk. Then another question is why call it "Naive" Bayes? This is because Naive Bayes follow such a naive hypothesis: all the features for classification are independent when the label is definitized, which is given by:

$P\left(X = x| Y = c_{k}\right)=P\left(X^{\left(1\right)}=x^{\left(1\right)},...,X^{\left(n\right)}|Y= c_{k}\right)=\prod_{j=1}^{n}P\left( X^{\left(j\right)}=x^{\left(j\right)}|Y=c_{k}\right)$

where x^(j) is the i-th feature, c_k is the k-th label. Then, the Bayes classifier can be defined as:

$y =arg\max \limits_{c_{k}}P\left(Y=c_{k}\right)\prod_{j}P\left(X^{\left(j\right)}=x^{\left(j\right)}|Y=c_{k}\right)$

So why maximize the posterior probability means minimizing the expected risk ? Let the loss is 0-1 loss function is

$L\left(Y,f\left(X\right)\right)=\left\{\begin{aligned} 0,Y\ne f\left(X\right)\\ 1,Y=f\left(X\right) \end{aligned}\right.$

where f(x) is the decision function. Then, the expected risk is

$R_{exp}\left(f\right)=E\left[L\left(Y,f\left(X\right)\right)\right]$

which is calculated from joint distribution P(X,Y). Thus the conditional expectation is:

$R_{exp}\left(f\right)=E_x\sum_{k=1}^{K}\left[L\left(c_{k},f\left(X\right)\right)\right]P\left(c_{k}|X\right)$

To minimize the expected risk, it needs to minimize each X = x, namely:

$f\left(x\right) =arg\min\limits_{y\in Y}\sum_{k=1}^{K}L\left(c_{k},y\right)P\left(c_{k}|X=x\right)\\ =arg\min\limits_{y\in Y} \sum_{k=1}^{K}P\left(y \ne c_{k}|X=x\right)\\ =arg\min\limits_{y\in Y}\left(1-P\left(y = c_{k}|X=x\right)\right)\\ =arg\min\limits_{y\in Y}P\left(y = c_{k}|X=x\right)$

Naive Bayes Model

Naive Bayes model consists of parameters estimation and classify.

Parameters Estimation

In the training process, learning means estimate the prior probability and conditional probability. Maximum likelihood estimation (MLE) is a general method to get the above parameters. The MLE of prior probability is given by:

$P\left( Y=c_{k}\right)=\frac{\sum_{i=1}^{N}{I\left(y_{i}=c_{k}\right)}}{N}$

Denote the j-th feature set is {a_j1,a_j2,...,a_jsi}.Then, the MLE of conditional probability is given by:

$P_{\lambda}\left(X^{(j)}=a_{jl}|Y=c_{k}\right)=\frac{\sum_{i=1}^{N}I\left(x_{i}^{(j)}=a_{jl},y_{i}=c_{k}\right)}{\sum_{i=1}^{N}I\left(y_{i}=c_{k}\right)}$

In the Naive Bayes training process, the prior probability and conditional probability is calculated. However, if a value of a feature has never occurred in the training set, it's probability is equal to zero, which will effect the result of posterior probability. To solve the problem, we introduce Laplace smoothing: add an integer $\lambda$ to the frequency of each random variable.

Then, the Bayesian estimation of prior probability is:

$P_{\lambda}\left( Y=c_{k}\right)=\frac{\sum_{i=1}^{N}{I\left(y_{i}=c_{k}\right)+\lambda}}{N+K\lambda}$

where N is the number of unique labels, the K is the number of samples. The code of prior probability is shown below:

prior_probability = {}
for key in range(len(label_value)):
  prior_probability[label_value[key][0]] = 
    (label_value[key][1] + self.laplace) / (N + K * self.laplace)  # laplace smooth
self.prior_probability = prior_probability

where label_value is the tuple of (label, label_num).

Similarly, the Bayesian estimation of conditional probability is:

$P_{\lambda}\left(X^{(j)}=a_{jl}|Y=c_{k}\right)=\frac{\sum_{i=1}^{N}I\left(x_{i}^{(j)}=a_{jl},y_{i}=c_{k}\right)+\lambda}{\sum_{i=1}^{N}I\left(y_{i}=c_{k}\right)+S_{j}\lambda}$

The code of conditional probability is shown below. A matrix is applied to save the conditional probability and S[j] is the number of unique labels of the j-th feature.

 # calculate the conditional probability
 prob = []
 # calculate the count (x = a & y = c)
 for j in range(feature_dim):
     count = np.zeros([S[j], len(label_count)])  # the range of label start with 1
     feature_temp = train_data[:, j]
     feature_value_temp = feature_value[j]
     for i in range(len(feature_temp)):
         for k in range(len(feature_value_temp)):
             for t in range(len(label_count)):
                 if feature_temp[i] == feature_value_temp[k] 
                         and train_label[i] == label_value[t][0]:
                    count[k][t] += 1             # x = value and y = label
      # calculate the conditional probability
      for m in range(len(label_value)):
          count[:, m] = (count[:, m] + self.laplace) / 
                  (label_value[m][1] + self.laplace*S[j])  # laplace smoothing
          # print(count)
     prob.append(count)
 self.conditional_probability = prob

Classify

After calculating the prior probability and conditional probability, the Bayesian classification model is:

$y =arg\max \limits_{c_{k}}P\left(Y=c_{k}\right)\prod_{j}P\left(X^{\left(j\right)}=x^{\left(j\right)}|Y=c_{k}\right)$

The classification code is shown below. The predict is a dictionary which includes the probability of each label. Then we just need to sort the predict and the prediction is the first element in the sorted dictionary.

    def classify(self, sample):
        predict = {}
        for m in range(len(self.label_value)):
            temp = self.prior_probability
              [self.label_value[m][0]]  # get the prior_probability of m-th label in label_value
            for n in range(len(sample)):
                if sample[n] in self.feature_value[n]:
                    # print(m, n)
                    index = np.where(self.feature_value[n] == sample[n])[0][0]
                    temp = temp * self.conditional_probability[n][index][m]
                else:
                    temp = self.laplace / 
                         (self.S[n] * self.laplace)  # if the value of feature is 
                                        # not in training set, return the laplace smoothing
            predict[self.label_value[m][0]] = temp
        return predict

Conclusion and Analysis

The Bayesian model is this article is Berniulli Bayesian model. Except that, there are other Bayesian model such as Guassian Bayesian model, Polynomial Bayesian model. Finally, let's compare our Bayesian model with the Bayes model in Sklearn and the detection performance is displayed below:

It is found that both methods achieve poor detection results. Moreover, our Bayesian model takes a longer runtime, which may be that the algorithm of conditional probability contains too many loops.

The related code and dataset in this article can be found in MachineLearning.