Download source code - 4.2 KB

The goal of this series on Sentiment Analysis is to use Python and the open-source Natural Language Toolkit (NLTK) to build a library that scans replies to Reddit posts and detects if posters are using negative, hostile or otherwise unfriendly language.

Part 1 - Introducing NLTK for Natural Language Processing with Python
Part 2 - Finding Data for Natural Language Processing
Part 3 - Using Pre-trained VADER Models for NLTK Sentiment Analysis
Part 4 - Pros and Cons of NLTK Sentiment Analysis with VADER
Part 5 - NLTK and Machine Learning for Sentiment Analysis
Part 6 - Improving NLTK Sentiment Analysis with Data Annotation
Part 7 - Using Cloud AI for Sentiment Analysis

Listening to feedback is critical to the success of projects, products, and communities. However, as the size of your audience increases, it becomes increasingly difficult to understand what your users are saying. For this, sentiment analysis can help.

In Using Pre-trained VADER Models for NLTK Sentiment Analysis, we examined the role sentiment analysis plays in identifying the positive and negative feelings others may have for your brand or activities. Analyzing unstructured text is a common enough activity in natural language processing (NLP) that there are mainstream tools that can make it easier to get started.

Python’s Natural Language Toolkit (NLTK) is an example of one of these tools. In the previous article, we learned how to retrieve data from Reddit, with its very popular online communities. We then used VADER analysis to derive a sentiment score based on that Reddit data. The sentiment score helps us understand whether comments in that Reddit data represent positive or negative views.

In this and additional articles, we’re going to try and improve upon our approach to analyzing the sentiment of our communities. We’ll start by reviewing the pros and cons of the VADER model we've used so far.

The Lexical Approach to Sentiment Analysis

The VADER Sentiment Analyzer uses a lexical approach. That means it uses words or vocabularies that have been assigned predetermined scores as positive or negative. The scores are based on a pre-trained model labeled as such by human reviewers.

For example, here’s a comment from the Reddit data:

import praw

# Connect to reddit to query a specific posting
reddit = praw.Reddit(client_id=’your-id’,    
         client_secret=’your-secret’, 
         user_agent=’your-agent’)
post = "https://www.reddit.com/r/learnpython/comments/fwhcas/whats_the_difference_between_and_is_not"
submission = reddit.submission(url=post)

# Get the comments from the post replacing ‘more’ expansion
submission.comments.replace_more(limit=None)
comments = submission.comments.list()
print(comments[116].body)

The output is:

'This is cool!'

The terms "This", "is", and "cool" each have an emotional intensity ranging from -4 to +4. Here’s the lexicon entry for the token "cool":

cool    1.3 0.64031 [1, 1, 2, 1, 1, 1, 2, 2, 2, 0]

Additional rules cover syntax elements like punctuation. The exclamation point, for example, is used to modify the overall intensity of a phrase or sentence. Other terms, such as "but" or "not", would modify the intensity in the opposite direction.

There are some distinct advantages to this approach:

For many applications, such as evaluating public opinion, performing a competitive analysis, or enhancing customer experience, this approach is easy to understand.
The lexical approach is quick to implement, requiring just readily available libraries and a few lines of code.
It's easy to capture a dataset for analysis.
It's efficient at analyzing large datasets.

There are also some disadvantages to this approach:

Misspellings and grammatical mistakes may cause the analysis to overlook important words or usage.
Sarcasm and irony may be misinterpreted.
Analysis is language-specific.
Discriminating jargon, nomenclature, memes, or turns of phrase may not be recognized.

For certain use cases that seek a higher level of accuracy, it may be worth evaluating alternatives.

More important, certain domain-specific contexts may need a different approach. For example, a target corpus that includes specialized terms, language, or knowledge — like a programming community — differs substantially from the social media posts the pre-trained VADER model initially used. Source code, for example, with the exception of the occasional aggressive variable name, can be misinterpreted in sentiment analysis.

There are some machine learning classification approaches that may help with this.

Next Steps

In this article, we quickly looked at some pros and cons of using a textual approach to NLP.

As a next step, NLTK and Machine Learning for Sentiment Analysis covers creating the training, test, and evaluation datasets for the NLTK Naive Bayes classifier.

If you need to catch up with previous steps of the VADER analysis, see Using Pre-trained VADER Models for NLTK Sentiment Analysis.