The Not-so-classic Spam Filter

Kornfeld Eliyahu Peter

4.46/5 (7 votes)

Mar 4, 2018

CPOL

9 min read

8185

156

The Machine Learning and Artificial Intelligence Challenge

Download source code - 1.1 MB

Introduction

Artificial Intelligence is popular and a growing subject on the palette of the developer community. They were talking about it since the 14^th century (at least), but modern computers and their growth of power gave a boost to the subject. Most (a majority, I would say) of us think it is a subject we need not know in every-day work, and for that, we will never look into it. We also have a lot of misleading definitions of what AI is and what use it has, which adds to the rejection. I will try to clear some clouds and create a very simple spam filter solution, to show how AI can be part of the most basic software even.

Background

A small overview of different spam filtering methods we can see today...

Human Selection Based

This category contains filtering methods that involve some human intervention to decide if something is spam or ham...

It contains community based filtering, where the responses from the surrounding community may or may not define something as spam. We can see it here on CP... When a certain number of members vote something or someone as spam, it/they will be removed by the site...

The other possibility is to block spam before delivery by forcing the sender to do a challenge that only humans can solve. It is useful as most spam systems are machine based to enable mass spreading, but at the same time can be annoying as it slows down the delivery... The most known samples for such system are CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart) and OTP (One Time Password).

List Based

Black/White list - a list that holds certain senders (based on names, IP addresses, domains and more) that should be blocked/allowed. For instance, Gmail's 'Report spam' button actually puts the sender on a black-list.

Gray list is a modern approach to try and create the black/white list on the fly. The idea is that a spam machine will try and send a mail only once and wil not trouble itself to check responses from the recipient. In the gray-list technique, the server will automatically block incoming mails from an unknown source (known will be defined on a white-list) and send back a fail message. Most legitimate mail servers will automatically try and re-send the message (they do it several times mostly to exclude the possibility of network failure), if it happens, the filter will extend the original white-list with the new sender.

Content Based

This kind of filter will work on every message on its own - in contrast with the list based approach. The idea is to check the content (text and attachment) and compare it with previously know-to-be-spam mails. To make such identification more intelligent, the comparison can be based on an evolving collection of mails that got updated with every identification. Another option to improve content based spam filters is to use more sophisticated text analyzing methods, namely moving from words to phrases and from text analytics to text analysis (from counting to understanding)...

Of Artificial Intelligence

I think this is very important to understand the different types of artificial intelligence, just before we try and build some code for it... While there can be a lot of ways to categorize AI, I would like to present my four steps that I found very easy to understand...

1. Strong Machine (But Not Real Intelligent)

These kind of software are not really AI (IMHO), as it does not fulfill the requirement, the use of the outcome of past decisions. The most interesting samples for this kind are the modern chess engines. They are so powerful, that they can beat any human, but they are not really intelligent, as they cannot reuse the past of the currently playing game, only adapt to pre-defined rule sets (opening and closing books, and previous game statistics). So it has the ability to adapt, but not to evolve, but we cannot ignore the power it represents...

2. The Machine that has Past

This kind differs from the previous one, as it can refine decisions not only based on pre-defined data, but also on its 'personal' experience. For instance, a self driving car can 'know' from pre-defined data, that on the current road the speed limit is 80 Km/h, but will refine it via observation of surrounding cars and their speed, and may drive at 65 Km/h only.

The spam filter we try to build here is another example of learning-on-the-go, as it should be able to redefine what spam is based on incoming data...

3. The Mindful Machine

This kind of machine (which at this point is largely hypothetical) is not only able to respond to external properties of the surrounding world, but can consider much more complicated social behaviors, and adopt to that...

The three-law robots created by Asimov, and their overprotecting nature are a good sample for this kind of machine...

4. Skynet

We all talk about this and fear it, but fear no more - this is a step even further from the previous, and still hypothetical - kind, and we may still stop it somewhere between 3 and 4...

This kind of machine is actually totally human. It can override any predefined behavior for its own good, just in the same way someone kills...

Solution

The solution has two parts:

Learning - This part can take an already classified (spam or ham) message and add its data (after analyzing it) to the fact-base. This is the training part.
Classifying - This part can take a new message and compare its data to the fact-base. After classifying the message, it will be passed to the learning part to improve the fact-base.

Both parts have in common some functionality that can take some text and break it apart to create some data to add/compare to the fact-base. This analysis is based on words.

The main idea is to add weight to every word according to its occurrences in spam and ham messages. The current weight of each word is stored in the fact-base and updated with every classification done.

The Core - Text Analyzing

private string[ ] Message2Words ( string Message )
{
    string szMessage = Message.ToLower( );
    List<char> oPunctuations = szMessage.Where( Char.IsPunctuation ).Distinct( ).ToList( );

    oPunctuations.AddRange( new char[ ] { ' ', '<', '>' } );

    string[ ] oWords = szMessage.Split( oPunctuations.ToArray( ) ).Where
                                ( szWord => szWord.Length > 2 ).ToArray( );

    return ( oWords.Except( _TransparentList ).ToArray( ) );
}

This method is responsible for splitting the sentence into words and removing those words that are found to be insignificant in defining a message. The list of insignificant words (TransparentList) is originally initialized with a few words only (and that's one point for improvement), and updated with a constant frequency or manually.

Other things to mention:

Adding '<' and '>' to punctuation chars to break around HTML tags, but without removing content like link addresses, that definitely can be used to identify spam.
Removing words less than 2 characters long - it is mostly a shortcut to not blow up the list of insignificant words.
I do not remove duplicate words. The number of occurrences is very important to the classification later, so we should count the words exactly as they appear. See for instance the word 'love'. It can be part of a spam and ham message too, however actual counting of the training data show that is much more frequent in spam than in ham. A distinct counting may give the wrong identification for a message with such a word.

The Learn Method

public void Learn ( string Message, MessageType Type, string[ ] List = null )
{
    string[ ] szList = List ?? Message2Words( Message );
    decimal nTemp;

    foreach ( string szWord in szList )
    {
        BaseFact oValue = _BaseFacts.ContainsKey( szWord ) ? _BaseFacts[szWord] : new BaseFact( );

        nTemp = (Type == MessageType.Spam) ? oValue.Spam++ : oValue.Ham++;

        _BaseFacts[szWord] = oValue;
    }
}

This method is fairly simple - all it does is record the number of occurrences of a specific word in a message of specified type... In other words, if a word occurs in a message flagged as spam, its spam counter will advance, if it occurs in a ham message, its ham counter will advance.

The Classify Method

public MessageType Classify ( string Message )
{
    string[ ] szList = Message2Words( Message );
    Dictionary<string, BaseFact> oList = _BaseFacts.Where
    ( oFact => szList.Contains( oFact.Key ) ).ToDictionary( oItem => oItem.Key, oItem => oItem.Value );
    MessageType eType = oList.Sum( oItem => oItem.Value.Spam ) > oList.Sum
                 ( oItem => oItem.Value.Ham ) ? MessageType.Spam : MessageType.Ham;

    Learn( Message, eType, szList );

    if ( ++_Trigger > _Max )
    {
        _Trigger = 0;
        Transparent( );
    }

    return ( eType );
}

This method summarizes the spam and ham votes for a new message, based on the counters on the words, and by the result decides how to classify it. Just before returning the decision to the caller, it will update the base-facts collection with the data from the new message. It will also initiate an update of the list of the insignificant words according to a pre-defined frequency (currently after each 1000 new messages).

Transparent List

public void Transparent ( )
{
    _TransparentList.AddRange( _BaseFacts.Where( oItem => oItem.Value.IsTransparent ).ToDictionary
                             ( oItem => oItem.Key, oItem => oItem.Value ).Keys.ToList( ) );
    _BaseFacts = _BaseFacts.Where( oItem => !oItem.Value.IsTransparent ).ToDictionary
                                 ( oItem => oItem.Key, oItem => oItem.Value );
}

When called, this method will move the insignificant words from the BaseFacts list to the TransparentList. The significance of a word computed like this: \(\frac{ \left | Spam - Ham \right | }{ Spam + Ham }\). If this is less than a pre-defined significance value (currently 0.5), the word is moved.

Supporting Code

There is, of course, some supporting code (see in attachment) that is used to feed the learning data to the spam filter and later check the test data.

Does It Work?

Yes! No! Depends!

Yes! It does pass the test with the supplied test data (after the training with the train data), with 100% success.

No! It mostly fails with real life spam mails, but mostly succeeds with real life ham mails... The reason is actually very clear. The train data supplied (probably on purpose) is some gibberish and does not even resemble spam (not as I experience spam at least), but consistent with the test data, so the code learned to identify a specific kind of spam only.

Depends! It totally depends on the train data. The larger and broader that data is, the wiser the software becomes. The interesting part is that feeding enough train data of a different kind can totally change/extend the way the software classifies spam, and that is exactly what we want from advanced AI.

Summary

I'm sure that most of you are saying just now: 'That's all?!', and by all means you are right. In my solution, I do not even introduce any sophisticated statistical computation (like the well known Naive Bayes filter which is very popular in spam filters), but use a low-grade counting-algorithm.

The reason is that I had a single purpose: show you how easily one can get AI to work for one's benefit. To demystify the AI term. To show that not only talking humanoid robots are AI. That it does not even start with self-driving cars. That there is a whole new world even before that, and you can use it to your own advantage.

Today we have loks, toasters, hearing aids and so on, that represent the lower (and practical) end of AI usage - try and be part of it!