Click here to Skip to main content
15,039,214 members
Articles / Artificial Intelligence / Machine Learning
Posted 12 May 2019


10 bookmarked

Step-by-Step Guide To Implement Machine Learning III - Naive Bayes

Rate me:
Please Sign up or sign in to vote.
4.33/5 (8 votes)
12 May 2019CPOL3 min read
Easy to implement machine learning

This article is an entry in our Machine Learning and Artificial Intelligence Challenge. Articles in this sub-section are not required to be full articles so care should be taken when voting.


Naive Bayes is a kind of classification based on Bayesian decision theory and feature conditional independence, which calculates the probability distribution based on conditional independence on training set as the detection model. For a given test object, the label of the maximum of the posterior probability is the prediction of the test object. Maximize the posterior probability means minimizing the expected risk. Then another question is why call it "Naive" Bayes? This is because Naive Bayes follow such a naive hypothesis: all the features for classification are independent when the label is definitized, which is given by:

P\left(X = x| Y = c_{k}\right)=P\left(X^{\left(1\right)}=x^{\left(1\right)},...,X^{\left(n\right)}|Y= c_{k}\right)=\prod_{j=1}^{n}P\left( X^{\left(j\right)}=x^{\left(j\right)}|Y=c_{k}\right)

where x(j) is the i-th feature, ck is the k-th label. Then, the Bayes classifier can be defined as:

y =arg\max \limits_{c_{k}}P\left(Y=c_{k}\right)\prod_{j}P\left(X^{\left(j\right)}=x^{\left(j\right)}|Y=c_{k}\right)

So why maximize the posterior probability means minimizing the expected risk ? Let the loss is 0-1 loss function is

L\left(Y,f\left(X\right)\right)=\left\{\begin{aligned} 0,Y\ne f\left(X\right)\\ 1,Y=f\left(X\right) \end{aligned}\right.

where f(x) is the decision function. Then, the expected risk is


which is calculated from joint distribution P(X,Y). Thus the conditional expectation is:


To minimize the expected risk, it needs to minimize each X = x, namely:

f\left(x\right) =arg\min\limits_{y\in Y}\sum_{k=1}^{K}L\left(c_{k},y\right)P\left(c_{k}|X=x\right)\\ =arg\min\limits_{y\in Y} \sum_{k=1}^{K}P\left(y \ne c_{k}|X=x\right)\\ =arg\min\limits_{y\in Y}\left(1-P\left(y = c_{k}|X=x\right)\right)\\ =arg\min\limits_{y\in Y}P\left(y = c_{k}|X=x\right)

Naive Bayes Model

Naive Bayes model consists of parameters estimation and classify.

Parameters Estimation

In the training process, learning means estimate the prior probability and conditional probability. Maximum likelihood estimation (MLE) is a general method to get the above parameters. The MLE of  prior probability is given by:

P\left( Y=c_{k}\right)=\frac{\sum_{i=1}^{N}{I\left(y_{i}=c_{k}\right)}}{N}

Denote the j-th feature set is {aj1,aj2,...,ajsi}.Then, the MLE of conditional probability is given by:


In the Naive Bayes training process, the prior probability and conditional probability is calculated. However, if a value of a feature has never occurred in the training set, it's probability is equal to zero, which will effect the result of posterior probability. To solve the problem, we introduce Laplace smoothing: add an integer \lambda to the frequency of each random variable.

Then, the Bayesian estimation of prior probability is:

P_{\lambda}\left( Y=c_{k}\right)=\frac{\sum_{i=1}^{N}{I\left(y_{i}=c_{k}\right)+\lambda}}{N+K\lambda}

where N is the number of unique labels, the K is the number of samples. The code of prior probability is shown below:

prior_probability = {}
for key in range(len(label_value)):
  prior_probability[label_value[key][0]] = 
    (label_value[key][1] + self.laplace) / (N + K * self.laplace)  # laplace smooth
self.prior_probability = prior_probability

where label_value is the tuple of (label, label_num).

Similarly, the Bayesian estimation of conditional probability is:


The code of conditional probability is shown below. A matrix is applied to save the conditional probability and S[j] is the number of unique labels of the j-th feature.

# calculate the conditional probability
prob = []
# calculate the count (x = a & y = c)
for j in range(feature_dim):
    count = np.zeros([S[j], len(label_count)])  # the range of label start with 1
    feature_temp = train_data[:, j]
    feature_value_temp = feature_value[j]
    for i in range(len(feature_temp)):
        for k in range(len(feature_value_temp)):
            for t in range(len(label_count)):
                if feature_temp[i] == feature_value_temp[k]
                        and train_label[i] == label_value[t][0]:
                   count[k][t] += 1             # x = value and y = label
     # calculate the conditional probability
     for m in range(len(label_value)):
         count[:, m] = (count[:, m] + self.laplace) /
                 (label_value[m][1] + self.laplace*S[j])  # laplace smoothing
         # print(count)
self.conditional_probability = prob


After calculating the prior probability and conditional probability, the Bayesian classification model is:

y =arg\max \limits_{c_{k}}P\left(Y=c_{k}\right)\prod_{j}P\left(X^{\left(j\right)}=x^{\left(j\right)}|Y=c_{k}\right)

The classification code is shown below. The predict is a dictionary which includes the probability of each label. Then we just need to sort the predict and the prediction is the first element in the sorted dictionary.

def classify(self, sample):
    predict = {}
    for m in range(len(self.label_value)):
        temp = self.prior_probability
          [self.label_value[m][0]]  # get the prior_probability of m-th label in label_value
        for n in range(len(sample)):
            if sample[n] in self.feature_value[n]:
                # print(m, n)
                index = np.where(self.feature_value[n] == sample[n])[0][0]
                temp = temp * self.conditional_probability[n][index][m]
                temp = self.laplace /
                     (self.S[n] * self.laplace)  # if the value of feature is
                                    # not in training set, return the laplace smoothing
        predict[self.label_value[m][0]] = temp
    return predict

Conclusion and Analysis

The Bayesian model is this article is Berniulli Bayesian model. Except that, there are other Bayesian model such as Guassian Bayesian model, Polynomial Bayesian model. Finally, let's compare our Bayesian model with the Bayes model in Sklearn and the detection performance is displayed below:

Image 13

It is found that both methods achieve poor detection results. Moreover, our Bayesian model takes a longer runtime, which may be that the algorithm of conditional probability contains too many loops.

The related code and dataset in this article can be found in MachineLearning.


This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)


About the Author

Germany Germany
Ryuk is interested in Machine Learning/Signal Processing/VoIP.

Comments and Discussions

Suggestionclassify method Pin
Member 1431662610-Sep-19 3:25
MemberMember 1431662610-Sep-19 3:25 
QuestionDisappointing Pin
Gerry Schmitz13-May-19 6:39
mveGerry Schmitz13-May-19 6:39 
GeneralMy vote of 1 Pin
Member 1405679712-May-19 22:37
MemberMember 1405679712-May-19 22:37 

General General    News News    Suggestion Suggestion    Question Question    Bug Bug    Answer Answer    Joke Joke    Praise Praise    Rant Rant    Admin Admin   

Use Ctrl+Left/Right to switch messages, Ctrl+Up/Down to switch threads, Ctrl+Shift+Left/Right to switch pages.