Brute Force Finds Spam

raddevus

4.47/5 (8 votes)

Feb 6, 2018

CPOL

6 min read

14130

236

Entry in the Artificial Intelligence and Machine Learning Contest. Here's how I learned / guessed how to find spam.

Download project (C# Console App) - 1.2 MB

Introduction

I know nothing about AI or Machine Learning. However, when I saw the data file associated with the contest, I became very interested in how I might determine which of the lines of test data were spam.

I decided to forego all research about the subject and just brute-forced a way into the thing for learning's sake.

Less than 100 Lines of C#

I was quite excited to see that I could defeat the problem with less than 100 lines of C# code.

One Class to Rule Them All

I have wrapped all of the code to do the work in one small class which I have named LanguageLearner.

From the Contest

I've named it that, because it will learn which words should indicate Spam (bad) and which words should indicate Ham (good).

If you've examined the data file provided by CP's contest (The Machine Learning and Artificial Intelligence Challenge[^] ) you'll know exactly what that means. I've also included that data file (SpamDetectionData.txt) in the project included at the top of this article.

Here's all the code for my class. All of the work is done in this class. I explain how to use it and specifically what it does below.

class LanguageLearner {
    public HashSet<String> SpamWords {get; private set;}
    public HashSet<String> HamWords {get; private set;}
    private string filePath;
    private String currentLine;
    private System.IO.StreamReader spamDataFile;
    private bool displayLines;
    public List<String> AllTestData {get; private set;}
    private string LastLineIndicatorText;
    public int hamCounter {get; private set;}
    public int spamCounter {get; private set;}
    
    public LanguageLearner(string filePath, bool displayLines = false){
        this.filePath = filePath;
        this.displayLines = displayLines;
        spamDataFile = new System.IO.StreamReader(filePath);
        SpamWords = new HashSet<String>();
        HamWords = new HashSet<String>();
        LastLineIndicatorText = "# Ham training data";
        Learn(true);
        LastLineIndicatorText = "# Test data";
        Learn(false);
        LoadTestData();
    }
    
    private void Learn(bool isLearningSpam){
        currentLine = spamDataFile.ReadLine();
        // I know the first line is garbage so I throw it away
        currentLine = spamDataFile.ReadLine();
        
        while (currentLine != null && currentLine != LastLineIndicatorText){
            var localS = currentLine.Trim("Spam,".ToCharArray()).Trim("Ham,".ToCharArray());
            var words = localS.Split(new char[]{' '});
            foreach (String word in words){
                if (isLearningSpam){
                    SpamWords.Add(word.TrimEnd('.'));
                }
                else{
                    HamWords.Add(word.TrimEnd('.'));
                }
            }
    
            //read next line in
            currentLine = spamDataFile.ReadLine();
        }
    }

    private void LoadTestData(){
        AllTestData = new List<String>();
        currentLine = spamDataFile.ReadLine();
        
        while (currentLine != null){
            AllTestData.Add(currentLine);
            currentLine = spamDataFile.ReadLine();
        }
    }
    
    public bool IsItSpam(string data){
        var dataWords = data.Split(' ');
        hamCounter = 0;
        spamCounter = 0;
        foreach (String token in dataWords){
            if (SpamWords.Contains(token)){ spamCounter++;}
            if (HamWords.Contains(token)){ hamCounter++;}
        }
        if (spamCounter >= hamCounter){
            return true;
        }
        return false;
    }
}

Code Explanation

The LanguageLearner class is very easy to use.

All you have to do is construct a LanguageLearner by passing the path to the SpamDetectionData.txt file to it.

LanguageLearner ll = new LanguageLearner(@"c:\users\<username>\SpamDetectionData.txt");

When you do that and the constructor is called, it will automatically parse through the file for you.

Here are the steps of what it does:

Sets up two HashSets (one for Spam words and another for Ham (safe) words
Calls a Learn() method to train itself on Spam words using the first N lines (marked Spam) in the SpamDetectionData.txt
Calls the same Learn() method again to train itself using the next N lines (marked Ham) to learn which words are considered safe
Calls LoadTestData() to load the lines of the SpamDetectionData.txt which are variably marked as Spam or Ham. I load this data into a List<String> so I can later iterate through it and let my program determine whether a line is Spam or Ham.
Once all of that is done, we can call the IstItSpam() method which will attempt to determine if a line of data is Spam or Ham.

Using the Code to Test the Data

Here's how you use the code to test the data. It's very simple.

First of all, I print some of the statistics out from what the code has done.

Console.WriteLine("Found {0} words.",ll.SpamWords.Count);
Console.WriteLine("Found {0} words.",ll.HamWords.Count);
Console.WriteLine(ll.AllTestData.Count);

That yields the following (as seen here running in LINQPad):

The LanguageLearner found 3374 (spam) words and it found 3419 (ham) words.

It also read in 100 lines of test data.

IsItSpam() Method: Brute-Force

Now, we will try our IsItSpam() method on the test data.

I set it up so it can easily send each line of data into our IsItSpam() method very easily.

foreach (String testData in ll.AllTestData){
      Console.Write("{0} : ",testData.Substring(0,4));
      Console.Write(ll.IsItSpam(testData.Trim("Spam,".ToCharArray()).Trim("Ham,".ToCharArray())));
      Console.WriteLine("\tSTATS : Ham weight = {0} Spam weight = {1}", ll.hamCounter, ll.spamCounter);
    }

I know each line of the AllTestData is simply a string which represents each line of the Test data from the file SpamDetectionData.txt.

Spam or Ham

The creators of the test data have preceded each line with either:

Spam,
Ham,

This is so we can determine if our test has been successful or not.

My first call Console.Write simply writes out that prefix and a colon (:) so you can see it on screen.

Next we Trim the Spam, or Ham, off the string and pass the actual data into the IsItSpam() method.

That method will return TRUE when it is Spam and FALSE when it is Ham.

If everything goes properly, we should see lines of output which look like:

Ham, : False
Spam : True

If all lines return that way, then we have successfully found SPAM with no false-positives and we have brute-forced our way there.

What Does IsItSpam() Do?

Here is how easy it was to brute-force the data using the IsItSpam method.

Here's the method again:

    public bool IsItSpam(string data){
        var dataWords = data.Split(' ');
        int hamCounter = 0;
        int spamCounter = 0;
        foreach (String token in dataWords){
            if (SpamWords.Contains(token)){ spamCounter++;}
            if (HamWords.Contains(token)){ hamCounter++;}
        }
        if (spamCounter >= hamCounter){
            return true;
        }
        return false;
    }

I simply set up a one counter for ham words and one counter for spam words. Then when the line of data is passed in, I check each word for existence in the associated word list (SpamWords and HamWords).

If the word is found in the list, then the associated counter is incremented.

Finally, if spamCounter is greater than or equal to (since that would be a lot of spam words), then I consider it spam and return true.

Otherwise, we return false.

Conclusion: 100% Success

I am glad to say that this brute-force method written by someone who is entirely untrained in AI or Machine-Learning is able to determine the correct Spam or Ham on 100% of the data.

This was a lot of fun and I hope it shows how it sometimes pays to "just write the darn thing". :)

Here's a snippet of my first few lines of output after the data was run in LINQPad (http://linqpad.net).

Dirty Data (Update 1)

If you examine the code more closely, you will see that the learned data (words in my two HashSets) is really quite dirty. I did a very quick algorithm which simply splits the words on spaces and that isn't entirely correct. That's part of the reason I was so shocked (and fascinated) that the algorithm was good enough to get 100% of the test data correct.

This has really got me thinking now how that you can create these two big lists of words and then get a fairly good idea about whether or not text is spam or not. This has really captured my imagination and I appreciate the CP editors for posing this challenge. Great stuff.

Ham & Spam Weight (Update 2)

I was curious about how closely the ham and spam counts might be, however the code didn't previously provide an easy way to get that statistical data. I wondered if the values would be extremely close or quite distant so I made a very small change to the LanguageLearner class to expose the HamCounter and SpamCounter values so we can get them each time the IsItSpam() method runs.

After that, I simply added the code in the Program's main() method to display that information when the program runs.

The output looks like the following and now you can examine how closely each of the test lines are weighted. Again, I'm startled that the brute-force method and dirty data gets such distant values for each. I thought there might be some closer values due to my sloppy code. :)

Additional Conclusion

Now that I've worked through this first puzzle and I better understand the challenges, I feel like I could go out and read some theories about machine-learning and AI and understand them much better.

Also, before I worked through this one, I felt like there was no way I'd be able to figure out the 2^nd language-based challenge, but now I feel like I could at least take a shot at it.

Note on Using the Code

Please make sure you change the path so that it is pointing to the location of your SpamDetectionData.txt file.

If you don't, the app will crash.

History

2018-02-06 evening: Added ability to display Ham and Spam weights to the program and added section to article explaining ham / spam weights. Also updated all relevant code snippets.
2018-02-06 later: Edited to add information about the use of dirty data
2018-02-06: First publication for entry in contest