You've Got Spam

Peter Leow

4.77/5 (10 votes)

Mar 2, 2018

CPOL

14 min read

17988

692

Design and implement a simple AI agent that can learn and fight the relentless spam plague.

Spam Plague

Since the advent of the Internet, spam have been prevalent and elusive in cyberspace. According to Oxford Dictionaries, the term Spam has its innocent origin as the trademark of

Quote:

A tinned meat product made mainly from ham.

as shown in Figure 1:

Figure 1: The Innocent Can of Spam

In the cyberspace, unfortunately, it becomes the infernal nuisance for all

Quote:

Irrelevant or unsolicited messages sent over the Internet, typically to a large number of users, for the purposes of advertising, phishing, spreading malware, etc.

Thanks to Oxford Dictionaries, I shall, metaphorically, refer to all legitimate messages as ham in this article.

Unless you are the spammer, the general sentiment is that people prefer ham to spam.

Even with the help of spam filtering software, some spam still manage to evade capture and arrive on our machine occasionally. Sieving out and clearing spam remains a chore of many digital-age mortals of today. With no way to stop the spammers, we can only place our hope on better and more intelligent spam filtering software to ward off the incessant spam. Taking on such a challenge will require equipping the traditional spam filtering software with better algorithms and the ability to learn autonomously.

In this article, we will embark on a project to explore the design and implementation of a simple AI agent that can learn and update the knowledge to determine whether a message is spam, given the words contained in that message. Let's call this agent Spam Doctor.

Problem Statement

Our learning journey will start with saying aloud the problem statement as follow:

Quote:

How do you determine that a message is probably a spam?

More often than not, you draw your conclusion from scanning the words contained in the message. Right? Based on experience, spam, though vary in topics and types, appear to share some common patterns in their choice of words. Based on a study by Symantec, it is observed that in spam:

Quote:

the popular words are fairly generic but all seem to be geared towards encouraging an immediate reaction, trying to get some sense of urgency.

Accordingly, we can reasonably hypothesize that "some words occur more frequently in spam than ham". It seems that we can leverage this hypothesis to distinguishing spam from ham. We can now rephrase the problem statement to read as:

Quote:

How probable that a message is a spam given that it contains certain words?

That sounds a lot clearer and more manageable. We now have a direction, but we are still unsure of what lies ahead, we are in need of a torch to show us the way. The torch that we are going to use here is Bayes' theorem.

A Quick Take on Bayes' Theorem

Life is full of uncertainties. Uncertainties beget anxiety as they prevent people from making decisions confidently. Be they trivial matters such as coin flipping and die rolling or life-and-death situations such as national security or medical diagnoses, people want to be assured of "how probable an event will or will not occur". To mitigate uncertainties, people may make inferences based on prior knowledge obtained from related empirical evidence, statistics, or a combination of personal opinion, experiences, and assessment.

For example, according to lung cancer statistics provided by Cancer Research UK,

Quote:

1 in 13 men and 1 in 17 women will be diagnosed with lung cancer during their lifetime.

In probability, the above statement can be expressed as follows:

$\begin{aligned} P("A\; man\; contracting\; lung\; cancer") = \frac{1}{13}\\ \\ P("A\; woman\; contracting\; lung\; cancer") = \frac{1}{17} \end{aligned}$

This type of probability, based solely on prior knowledge of the sample distribution, is sometimes called the prior probability. Without other supporting evidence, prior probability alone simply expresses a belief in a very general way, i.e. every man has an equal chance of contracting lung cancer and so does every woman, which we know it is hardly the case.

The lung cancer statistics provided by Cancer Research UK also found that:

Quote:

A person’s risk of developing cancer depends on many factors, including age, genetics, and exposure to risk factors (including some potentially avoidable lifestyle factors).

So apart from gender, this statement makes it clear that the chance of contracting lung cancer varies from person to person, depending on other evidence, such as age, genetics, smoking, diet, etc., presented in each person. We are more certain of the outcome if we can incorporate such evidence in the probability. Simply put it, we are looking for a way to find the probability of a person contracting lung cancer taking into account the present of any related risk factors such as:

$P("A\; person\; contracting\; lung\; cancer"|"Male,\; aged\; 75,\; smoker")$

where the $|$ denotes "given".

This leads us to Bayes' theorem...

Bayes' theorem describes a way to derive and update the probability of a hypothesis (a belief that an event will occur) taking into account the evidence for and against the hypothesis. This type of probability is sometimes called posterior probability as its value is derived from updating the prior probability taking into account the evidence for and against the hypothesis. The following formula shows the Bayes' rule in dealing with multiple hypotheses $H_{k}$ and multiple evidence $E_{n}$

$P( H_{i} |E_{1}E_{2}\cdot \cdot \cdot E_{n})= \frac{P(E_{1}|H_{i}) \times P(E_{2}|H_{i})\cdot \cdot \cdot \times P(E_{n}|H_{i})}{\sum_{k=1}^{m}P(E_{1}|H_{k}) \times P(E_{2}|H_{k})\cdot \cdot \cdot \times P(E_{n}|H_{k})\times P( H_{k})} \times P( H_{i})$

where

$P(H_{i})$ is the prior probability of hypothesis $i$.
$P(E_{n}|H_{i})$ is the conditional probability of evidence $n$ given that hypothesis is true.
$P(H_{i} |E_{1}E_{2}\cdot \cdot \cdot E_{n})$ is the posterior probability after taking into account evidence for and against hypothesis $i$.
The hypotheses must be mutually exclusive and exhaustive, i.e. $H_{j}\cap H_{j}= \oslash\; for\; i\neq j$, and $\sum_{k=1}^{m}P( H_{k})$ $= 1$.
The evidence are independent, i.e. $ P(E_{i}|E_{j})$ $= P(E_{i})$ $for\; i \neq j$.

The Bayes' rule relates the prior probability of hypothesis before getting the evidence, i.e. $P( H_{i})$, to the posterior probability of the hypothesis after getting the evidence, i.e. $P( H_{i} |E_{1}E_{2}\cdot \cdot \cdot E_{n})$. The factor that relates the two is sometimes called the likelihood ratio, i.e. $\frac{P(E_{1}|H_{i}) \times P(E_{2}|H_{i})\cdot \cdot \cdot \times P(E_{n}|H_{i})}{\sum_{k=1}^{m}P(E_{1}|H_{k}) \times P(E_{2}|H_{k})\cdot \cdot \cdot \times P(E_{n}|H_{k})\times P( H_{k})}$.

We will use Bayes' rule as the algorithm to compute the posterior probability of the hypothesis that "a message is spam" given the words contained in the message as evidence.

The Algorithm

We have found the algorithm, i.e. the Bayes' rule, for Spam Doctor. Let's implant this algorithm to Spam Doctor.

The Hypotheses

A message is either spam or ham, that gives us two mutually exclusive and exhaustive hypotheses as shown:

$H_{spam} : The\; message\; is\; spam\\ \\ H_{ham} : The\; message\; is\; ham\\ \\ P(H_{spam}) + P(H_{ham}) = 1$

The Evidence

The set of words, minus noises like stop words, punctuation, and HTML tags, contained in a message is the evidence. The evidence is defined as a set of words which means every word in it appears only once even it appears multiple times in the message. For example,

$if$

$Set\; of\; Words = \left \{ "pomp", "vexed", "hellas", "lurked" \right \}$

$then$

$E_{1} = "pomp"$

$E_{2} = "vexed"$

$E_{3} = "hellas"$

$E_{4} = "lurked"$

A Smarter Spam Doctor

Our Spam Doctor has become more intelligent with the implant of Bayes' rule in the following form:

$P( H_{spam} |E_{1}E_{2}\cdot \cdot \cdot E_{n})= \frac{P(E_{1}|H_{spam}) \times P(E_{2}|H_{spam})\cdot \cdot \cdot \times P(E_{n}|H_{spam})}{\sum_{k=ham}^{spam}P(E_{1}|H_{k}) \times P(E_{2}|H_{k})\cdot \cdot \cdot \times P(E_{n}|H_{k})\times P( H_{k})} \times P( H_{spam})$

Using this formula, the Spam Doctor will be able to compute the posterior probability that "a message is spam given the set of words that it contains", i.e. (P( H_{spam} |E_{1}E_{2}\cdot \cdot \cdot E_{n})).

Having said that, our Spam Doctor still lacks certain essential knowledge for it to function. Such knowledge includes:

The prior probability that a message is spam, i.e. $P( H_{spam} )$.
The prior probability that a message is ham, i.e. $P( H_{ham} )$.
The conditional probability of a word appearing in spam, i.e. $P(E_{n}|H_{spam})$.
The conditional probability of a word appearing in ham, i.e. $P(E_{n}|H_{ham})$.

Machine Learning

To acquire knowledge, we learn. The same goes for our Spam Doctor. It will acquire the knowledge by machine-learning from two training data sets comprising 1000 spam messages and 1000 ham messages each. The code for implementing machine learning for Spam Doctor is included in this Python program called spam_trainer.py.

Let's walk-through the stages of machine learning accompanied by related code snippets and/or screenshots where applicable.

Pre-process Training Data

Pre-process the training data by:

Removing noises like stop words, punctuation, and HTML tags from each message in the two training data sets; then
Tokenizing each message into a set of words where each word appears only once.

While removal of noises helps in improving the speed of processing, counting each word only once minimizes bias towards a particular hypothesis.

The function for pre-processing the training data is shown below:

def tokenize(dataset):

    # Convert all letters to lowercase
    temp1 = [ message.lower() for message in dataset ]
    # print(temp1[-1], end='\n\n')

    # Relegate each unwanted word to a whitespace
    temp2 = [ message.replace('<p>', ' ').replace('</p>', ' ').replace('<a href="https://', ' ').replace('">', ' ').replace('</a>', ' ') for message in temp1 ]

    # Break each message into tokens of word
    temp3 = [ word_tokenize(message) for message in temp2 ]

    # Remove duplicate tokens in each message
    temp4 = [ set(element) for element in temp3 ]
    # print(temp4[-1], end='\n\n')

    # Remove tokens of stop words and punctuation
    stopWords = set(stopwords.words('english'))
    stopWords.update(string.punctuation)

    finalDataset = []

    for tokenList in temp4:
        temp5 = []
        for token in tokenList:
            if token not in stopWords:
                temp5.append(token)
        finalDataset.append(temp5)
    
    return finalDataset

The following is one of the messages included in the spam training data set:

<p>But could then once pomp to nor that glee glorious of deigned. The vexed times childe none native. To he vast now in to sore nor flow and most fabled. The few tis to loved vexed and all yet yea childe. Fulness consecrate of it before his a a a that.</p><p>Mirthful and and pangs wrong. Objects isle with partings ancient made was are. Childe and gild of all had to and ofttimes made soon from to long youth way condole sore.</p>

After pre-processing, the above message is being tokenized into a set of words as shown below:

'partings', 'sore', 'childe', 'none', 'soon', 'isle', 'native', 'ofttimes', 'consecrate', 'mirthful', 'objects', 'glee', 'gild', 'condole', 'ancient', 'deigned', 'fulness', 'times', 'glorious', 'way', 'wrong', 'made', 'flow', 'vexed', 'pomp', 'loved', 'tis', 'could', 'yea', 'yet', 'long', 'youth', 'fabled', 'vast', 'pangs'

Compute Conditional Probabilities of Tokenized Words in Spam and Ham

Compute the conditional probability of each tokenized word occurring in spam training data set and ham training data set respectively:

$\begin{aligned} P(E_{n}|H_{spam}) = \frac{P(E_{n})\cap P(H_{spam} )}{P(H_{spam})}\\ \\ P(E_{n}|H_{ham}) = \frac{P(E_{n})\cap P(H_{ham} )}{P(H_{ham})} \end{aligned}$

The function for computing conditional probabilities of tokenized words contained in a training data set is shown below:

def tokenConditionalProbability(dataset):

    # Number of samples in dataset
    sampleSize = len(dataset)

    # Dictionary of token-probability pairs
    conditionalProbabilities = {}

    # Count probability of occurence of each token
    flatten = []
    flatten[len(flatten):] = [ token for sample in dataset for token in sample ]
    tokenCount = Counter(flatten)
    conditionalProbabilities = { key : value / sampleSize for key, value in tokenCount.items()}        

    return conditionalProbabilities

The following shows some of the tokenized words originated from the spam training data set and their associated conditional probabilities:

'objects': 0.26, 'made': 0.471, 'ancient': 0.263, 'sore': 0.466, 'fulness': 0.29, 'glorious': 0.265, 'native': 0.478, 'could': 0.481, 'partings': 0.287, 'consecrate': 0.287, 'loved': 0.616, 'tis': 0.471, 'deigned': 0.262, 'pangs': 0.27, 'vast': 0.276, 'times': 0.292, 'long': 0.621, 'mirthful': 0.297, 'youth': 0.286, 'wrong': 0.295, 'fabled': 0.289, 'condole': 0.286, 'soon': 0.311, 'isle': 0.293, 'pomp': 0.281, 'gild': 0.293, 'vexed': 0.253, 'ofttimes': 0.275, 'glee': 0.316, 'childe': 0.865, 'none': 0.69, 'flow': 0.252, 'way': 0.27, 'yea': 0.277, 'yet': 0.714, 'saw': 0.263, 'change': 0.274, 'would': 0.731, 'deem': 0.468, 'talethis': 0.285, 'old': 0.274, 'dome': 0.249, 'atonement': 0.267, 'spoiled': 0.276, 'things': 0.283, 'holy': 0.279, 'cell': 0.278, 'suffice': 0.284, 'mote': 0.497, 'vaunted': 0.309, 'noontide': 0.279, 'break': 0.28, 'days': 0.261, 'basked': 0.279, 'breast': 0.503, 'found': 0.282, 'adieu': 0.289, 'adversity': 0.282, 'love': 0.464, 'men': 0.301, 'prose': 0.274, 'strange': 0.269, 'said': 0.283

Compute Prior Probabilities of Hypotheses

With a total of 2000 messages in the sample space, when randomly selecting a message from this sample space, the prior probability of picking a spam message is the same as that of picking a ham message as shown:

$\begin{aligned} P(H_{spam})=\frac{1000}{2000}=0.5 \\ \\ P(H_{ham})=\frac{1000}{2000}=0.5 \end{aligned}$

Set a Threshold for Discerning between Spam and Ham

A message is deemed spam if the posterior probability for the spam hypothesis exceeds a threshold value, say 0.8, as expressed below:

$P( H_{spam} |E_{1}E_{2}\cdot \cdot \cdot E_{n}) > 0.8 \Rightarrow The\; message\; is\; spam$

It is advisable to set the threshold higher so as to minimize "false positive" where ham is being falsely identified as spam. The optimal threshold varies from person to person and may be arrived at by trial and error.

Saving the Learned Knowledge

At the end of the training, our Spam Doctor will have acquired the knowledge necessary to implement the Bayes' rule algorithm. You should save the learned knowledge, i.e. conditional probabilities of words in spam, conditional probabilities of words in ham, prior probabilities of hypotheses, and the threshold for distinguishing between spam and ham, to some computer storage for reuse in testing and production. They form the knowledge base of the Spam Doctor.

Testing 1, 2, 3

Spam Doctor has acquired an algorithm and learned the essential knowledge to distinguish between spam and ham. However, we do not know how well it can do the work? Let's put it through a test.

The test is done by feeding the Spam Doctor on a mix of 43 spam and 57 ham messages from a test data set. The stages of conducting the test is as follows:

Remove noises like stop words, punctuation, and HTML tags from each message in the test data set.
Tokenize each message in the test data set into a set of words where each word appears only once.
Calculate the posterior probability for the spam hypothesis of each message given the set of tokenized words it contained, using the conditional probabilities of those words and prior probabilities of hypotheses from the knowledge base. Any tokenized words not found in the knowledge base is given a conditional probability of 0.01 instead of 0. This is to minimize bias towards a particular hypothesis.
If the posterior probability for the spam hypothesis of a message exceeds a threshold value, say 0.8, it is diagnosed as spam, or else ham.
At the end of the test, compare the outcome of each message tested by Spam Doctor against its expected outcome in the test data set. For example, instead of getting the expected outcome as spam from the test, a spam message may be wrongly diagnosed as ham in the test.

The code for implementing the test is included in spam_trainer.py.

Specifically, the function for computing the posterior probability for the spam hypothesis of each message in the test data set is shown below:

def spamPosteriorProbability(tokenList):
    spamTokenConditionalProbability = 1
    hamTokenConditionalProbability = 1
    for token in tokenList:
            
        if token not in spamTokensConditionalProbabilities:
            spamTokenConditionalProbability *= 0.01 # To minimize false positive
        else:
            spamTokenConditionalProbability *= spamTokensConditionalProbabilities[token]
            
        if token not in hamTokensConditionalProbabilities:
            hamTokenConditionalProbability *= 0.01 # To mininize false negative
        else:
            hamTokenConditionalProbability *= hamTokensConditionalProbabilities[token]    

    return spamTokenConditionalProbability * spamPriorProbability / (spamTokenConditionalProbability * spamPriorProbability + hamTokenConditionalProbability * hamPriorProbability)

We can assess the performance of Spam Doctor in the test using a confusion matrix as shown in Figure 2:

Figure 2: Confusion Matrix

True Positive is the number of expected spam being successfully identified as spam by Spam Doctor in the test. It is the intersection between Expected Spam and Tested Spam in Figure 2. It shows 43 as a result of the test.
False Negative is the number of expected spam being falsely identified as ham by Spam Doctor in the test. It is the intersection between Expected Spam and Tested Ham in Figure 2. It shows 0 as a result of the test.
False Positive is the number of expected ham being falsely identified as spam by Spam Doctor in the test. It is the intersection between Expected Ham and Tested Spam in Figure 2. It shows 0 as a result of the test.
True Negative is the number of expected ham being successfully identified as ham by Spam Doctor in the test. It is the intersection between Expected Ham and Tested Ham in Figure 2. It shows 57 as a result of the test.
Accuracy refers to the overall correctness of the test and is expressed as:
$(True\;Positive + True\;Negative) \div Total = (43 + 57) \div 100=1$

Precision refers to the correctness of a test outcome and is expressed as:
$True\;Positive \div (True\;Positive + False\;Positive) = 43 \div (43 + 0) = 1$

The function to compute the various measurements of the confusion matrix is shown below:

for data, spamPosteriorProbability in zip(testSet, spamPosteriorProbability):
    expected = data.split(',')[0]
    if expected == 'Spam':
        if spamPosteriorProbability > threshold:
            truePositive += 1
        else:
            falseNegative += 1
    elif expected == 'Ham':
        if spamPosteriorProbability > threshold:
            falsePositive += 1
        else:
            trueNegative += 1

The confusion matrix shows that our Spam Doctor has passed the test with flying color! :thumbsup:

Putting to Good Use

Now that we have a competent AI agent the Spam Doctor, why not put it to good use? As an example, I have written a simple GUI application called spam_doctor.py in Python that incorporates the intelligence and knowledge of the Spam Doctor. When launched, this application presents a text box for users to paste a message into it and a button to submit the message to the Spam Doctor engine for diagnosis. The result of the diagnosis is then announced in a popup box. Check it out yourself using messages from the test data set or catch a glimpse of the Spam Doctor in action as animated in Figure 3.

Figure 3: Spam Doctor at Work

Conclusion

Using spam filtering project as a pretext, this article has walked you through the fundamentals of design and implementation of a simple AI agent and machine learning.

Like human beings, AI agents learn the knowledge of problem solving by applying appropriate algorithms over historical data. A good AI agent should then be able to generalize from the learned knowledge to solve new problems that it has not encountered during the training phase.

Owing to the trade-off from generalization, an AI agent, like human beings, is unlikely to achieve perfect scores for accuracy and precision like those achieved by our Spam Doctor. To really prepare the Spam Doctor to face the real world, you must train it with plenty of real-world spam and ham messages, then test it with plenty of real-world spam and ham messages that it has not encountered during training. After each test phase, use the confusion matrix or some other appropriate tools to measure its performance based on the test. Tune some parameters such as the threshold value, re-train, and re-test it over and over again until a favorable performance is achieved. For that, I shall leave it in your good hands.

Last but not least, even if an AI agent makes a mistake, it should learn from the mistake by including the failed experience in its knowledge base so that it will not repeat the same mistake in the future. Isn't this applicable to us human beings as well?