NLTK and Machine Learning for Sentiment Analysis

Jayson DeLancey

5.00/5 (3 votes)

May 29, 2020

CPOL

9 min read

10494

This article is the fifth in the Sentiment Analysis series that uses Python and the open-source Natural Language Toolkit. In this article we're building an optimized machine learning model.

Download source code - 4.2 KB

The goal of this series on Sentiment Analysis is to use Python and the open-source Natural Language Toolkit (NLTK) to build a library that scans replies to Reddit posts and detects if posters are using negative, hostile or otherwise unfriendly language.

Part 1 - Introducing NLTK for Natural Language Processing with Python
Part 2 - Finding Data for Natural Language Processing
Part 3 - Using Pre-trained VADER Models for NLTK Sentiment Analysis
Part 4 - Pros and Cons of NLTK Sentiment Analysis with VADER
Part 5 - NLTK and Machine Learning for Sentiment Analysis
Part 6 - Improving NLTK Sentiment Analysis with Data Annotation
Part 7 - Using Cloud AI for Sentiment Analysis

At the intersection of statistical reasoning, artificial intelligence, and computer science, machine learning allows us to look at datasets and derive insights. Supervised learning is the process by which we start with a dataset that maps inputs with an expected output that has been labeled.

In contrast, unsupervised learning does not use labeling, so with this approach, we need to take an active role in evaluating algorithms and iterate through the evaluation process to identify the features or defining patterns that are meaningful in producing the labeled result.

What does that iterative supervised learning process look like? A few questions we’ll look at include:

What question are we trying to answer? For this project, we want to know if users are responding positively or negatively about our brand or something we shared on Reddit.
Do we have enough data? In previous examples, we looked only at a single post and its comments, so we’ll need to assemble a testing harness to examine more data both for our evaluation and for data to train with.
What features of the dataset can we identify? A feature is a value that we label to describe a characteristic of the data. The use of an exclamation point might be a feature to observe as it could denote a meaningful intense reaction.
How will we measure success?

These seem like huge questions. But as with many programming problems, we can break the solution into achievable steps, and tools are available to help us each step along the way.

We’ll start by splitting our corpus up into multiple parts, or sets:

training set
test set
evaluation set

Creating a custom corpus can be a big job. It requires a person to first assemble all of the text together, and then check item by item to annotate or label it with a value. For example, a human reviewer would label "This is cool" as positive and "I don’t recommend it" as negative.

Before committing to that effort, it is worth a look around to see if there is an existing dataset that can be used. Once we find one, we’ll split that dataset into a training set and a test set. We'll use the training set to tune the algorithm and build a classification model. Then we'll use the test set to judge the accuracy of the classification model.

We can look at different ratios for how to split the dataset, but a 75/25, 80/20, or 90/10 split is common. That is, for 1000 records, we can use 800 of them to train and 200 to test.

The evaluation set will be any given Reddit post we want to compare to the VADER analysis we performed in Using Pre-trained VADER Models for NLTK Sentiment Analysis. What we'll discuss here doesn't require understanding that analysis, but it may be useful.

Preparing Data for Analysis with the Naive Bayes Classifier

In the article Finding Data for Natural Language Processing, we downloaded and took a look at the movie review corpus that is available from NLTK. We learned it was a collection of simple text files that had been categorized into positive and negative values.

We’re going to use that corpus again as a training set for building a Naive Bayes Classifier. We use this because it’s easily accessible in NLTK and easy to understand the concept of a film critic having positive or negative reviews on a movie as we focus our attention on understanding the supervised learning process.

We start by structuring the dataset into tuple pairs containing the list of sentences as the first element, and a simple string label as the second element to indicate the positive or negative sentiment.

import random
from nltk.corpus import movie_reviews

# return a list of tuple pairs
#   a string of the raw review before tokenization
#   a string label of a pre-classified sentiment ('pos' or 'neg')
def get_labeled_dataset():
    dataset = []
    for label in movie_reviews.categories():
        for review in movie_reviews.fileids(label):
            dataset.append((movie_reviews.raw(review), label))

    random.shuffle(dataset)
    return dataset

Note that, due to the way this corpus is structured with positive reviews and then negative reviews being clustered together sequentially, we’re going to have to randomize it. If we didn’t randomize it, all of the positive reviews would be in the training set and all of the negative reviews in the test set. That distribution would be skewed instead of evenly spread out and give us bad results.

We also need to create a dictionary of features that can help describe the dataset. We’ll spend most of our time tuning these features, but we’ll start with something simple: the number of characters used.

def get_features(text):
    features = {}

    # Feature #1 - verbosity
    features['verbosity'] = len(text)

    return features

The features are expected to be string label keys with a simple type value, which is a natural fit for a Python dictionary.

Analyzing Sentiment with the Naive Bayes Classifier

With a dataset and some feature observations, we can now run an analysis. We’ll start with the Naive Bayes Classifier in NLTK, which is an easier one to understand because it simply assumes the frequency of a label in the training set with the highest probability is likely the best match.

Let’s evaluate the results:

import nltk.classify
from nltk import NaiveBayesClassifier

def evaluate_model(dataset, train_percentage=0.9):
    feature_set = [(get_features(i), label) for (i, label) in dataset]
    count = int(len(feature_set) * train_percentage)
    train_set, test_set = feature_set[:count], feature_set[count:]
    classifier = NaiveBayesClassifier.train(train_set)
    return nltk.classify.accuracy(classifier, test_set)

dataset = get_labeled_dataset()
print(evaluate_model(dataset))
# Output: 0.53

We take all of the reviews in the dataset and label them as positive or negative using the get_labeled_dataset function shown earlier. We then substitute the review with a dictionary that identifies features we believe may be important, such as verbosity, using the get_features function we defined above.

Next, we split up the dataset, using 90% to train the classifier. We save the remaining 10% to a test set.

    count = int(len(feature_set) * train_percentage)
    train_set, test_set = feature_set[:count], feature_set[count:]

We run the Naive Bayes Classifier on the training dataset. The model can look at the features present in each of the items and make a guess. The more often the guess matches the actual label, the more accurate our model is.

    classifier = NaiveBayesClassifier.train(train_set)
    return nltk.classify.accuracy(classifier, test_set)

In this case, the accuracy is 53%, which isn’t very impressive, but the number of characters used in a review doesn’t seem like a useful feature to reference either. Verbosity may be a useful engagement feature, but not sentiment.

We used a VADER analysis to identify a sentiment in Using Pre-trained VADER Models for NLTK Sentiment Analysis, but now with this approach, we can judge how accurate those polarity scores were in predicting a sentiment.

If the VADER score for a review has a positive intensity, we’d expect that to match the value a human identified for positive reviews. We can add this score as a feature in our model and run the training again.

analyzer = SentimentIntensityAnalyzer()

def get_features(text):
    features = {}

    # Feature #1 - verbosity
    features['verbosity'] = len(text)

    # Feature #2 and #3 - lexical word choice
    scores = analyzer.polarity_scores(text)
    features['vader(pos)'] = scores['pos']
    features['vader(neg)'] = scores['neg']

    return features

If we run evaluate_model(dataset) this time with three features, we see a 62% accuracy evaluation. That’s an improvement over just using verbosity as a feature alone, but not significant enough to say we have a successful model in predicting a movie review as positive or negative.

What can we do next to improve accuracy?

A useful approach is to look at prediction errors. Any movie reviews which were misidentified as positive or negative can be informative in identifying other features we should consider adding to the model.

Classification Model Error Analysis

The accuracy of our model at this point is 62%. What’s happening with the 38% we got wrong? Should we add a new feature? Should we try a different classification algorithm? Is there a problem with the data?

To understand the next steps to achieve higher accuracy is a process of trial and error. It can be informative to look at instances where our classification model fails to correctly identify a sentiment. These "prediction errors" can be analyzed to derive new insights into how to train our models if we can identify new features that help differentiate test data that was misclassified.

Let’s create a variation of our evaluate_model() method to get some additional insight. The changes are highlighted in red.

def analyze_model(dataset, train_percentage=0.9):
    feature_set = [(get_features(i), label) for (i, label) in dataset]
    count = int(len(feature_set) * train_percentage)
    train_set, test_set = feature_set[:count], feature_set[count:]
    classifier = NaiveBayesClassifier.train(train_set)
    classifier.show_most_informative_features(5)

    accuracy = nltk.classify.accuracy(classifier, test_set)
    errors = []
    for (text, label) in dataset[count:]:
        guess = classifier.classify(get_features(text))
         
        if guess != label:
            tokens = nltk.word_tokenize(text)
            errors.append((label, guess, tokens[:10]))
    return (accuracy, errors)

When we use this variation, the first thing to dig into is the call to show_most_informative_features(). The output looks like this:

Most Informative Features
    vader(pos) = 0.099             neg : pos    =      8.9 : 1.0
    vader(neg) = 0.055             pos : neg    =      7.1 : 1.0
    vader(pos) = 0.131             neg : pos    =      6.9 : 1.0
    vader(pos) = 0.176             pos : neg    =      6.4 : 1.0
    vader(pos) = 0.097             neg : pos    =      6.1 : 1.0

This gives us an ordered list of the features most successful in predicting a result.

The first item shows us the feature vader(pos), when it has a value of exactly 0.099, is 8.9 times more likely to predict a negative feature correctly.

There are a couple of conclusions we can reach from this output.

First, we used a continuous value from 0 to 1 in our feature. This type of algorithm would work better if we used discrete values and ideally binary. We could group the sentiment scores into buckets as a way of expressing it as a feature.

Second, it is noteworthy that verbosity does not show up as a most informative feature. As we test a hypothesis regarding what features we should include in a model, sometimes we should drop features that do not help.

We made a second modification to the analyze_model() function to collect errors into an array. This displays items where the prediction and the actual value do not match. It also displays a subset of the movie review itself. Here are a few examples:

[
('pos', 'neg', ['susan', 'granger', "'s", 'review', 'of', '``', 'hearts', 'in', 'atlantis', '``']), 
('neg', 'pos', ['phil', '(', 'radmar', 'jao', ')', 'has', 'a', 'hairy', 'problem', '.']), 
('pos', 'neg', ['an', 'astonishingly', 'difficult', 'movie', 'to', 'watch', ',', 'the', 'last', 'temptation']), 
('neg', 'pos', ['while', 'watching', 'loser', ',', 'it', 'occurred', 'to', 'me', 'that', 'amy']),
...]

In some cases, we predicted a positive review that was labeled negative and in other cases the reverse. The example highlighted in red was labeled as positive, but we predicted negative. The phrase, "an astonishingly difficult movie to watch, the last temptation," sounds like a negative review even though it was labeled positive, so warrants closer inspection.

We can find the complete review in nltk_data/corpora/movie_reviews/pos_cv440_15243. The review includes phrases such as:

"An astonishingly difficult movie to watch"
"It just drags in the middle, with nothing truly happening"
"The film, though, has many trouble spots"

This looks like a mixed review at best to me, and a good example of how the data that goes into training our machine learning model is only as good as the labels originally assigned to it. It’s important to identify a dataset with labels that match your problem.

Next Steps

Building an optimized machine learning model is a trial and error process, but now we have a clear goal and steps toward that goal.

We’ll look at building our own dataset in Improving NLTK Sentiment Analysis with Data Annotation.

To review our previous steps toward performing NLP analysis with VADER and NLTK, see Using Pre-trained VADER Models for NLTK Sentiment Analysis.