Download source - 1.6 KB

This is the sixth module in our series of learning Python and its use in machine learning and AI. In the previous one, we discussed image recognition with OpenCV. Now let’s take a look at what you can do with the Natural Language Toolkit (NLTK).

Installation

NTLK can be installed using Anaconda:

conda install nltk

Or with pip, by running this in a Jupyter Notebook cell:

!pip install --upgrade nltk

If the following Python code runs without errors, the installation succeeded:

import nltk

NLTK comes with a lot of data (corpora, grammars, models, and more) that you can download. Simply run this Python command to display an interactive download window:

ntlk.download()

For this module, you'll need to install the "stopwords" corpus. After downloading it, create an environment variable named NLTK_DATA containing the path of the download directory (this isn’t needed if you do a central installation; refer to the documentation for a complete guide on installing the data).

Text Classification

To classify text means to assign a label to it. There are various ways text can be classified, such as with sentiment analysis (positive/negative[/neutral]), spam classification (spam/not spam), by document topic, and so forth.

In this module, we'll walk through a text classification example using the Large Movie Review Dataset, which offers 25,000 movie reviews (both positive and negative) for training and an equal number for testing.

NLTK offers a Naive Bayes Classifier to handle the machine learning work. Our job is mainly to write a function that extracts "features" from the text. The classifier uses these features to perform its classification.

Our function, called feature extractor, takes a string (the text) as an argument and returns a dictionary that maps feature names to their values, called feature set.

For the movie reviews, our features will be the top N words (excluding stop words). So, the feature extractor will return a feature set with those N words as keys, and a Boolean indicating their presence or absence as a value.

The first step is to go through the reviews, store all words (except stop words), and find the most common words.

First, this helper function takes a text and outputs its non-stop words:

import nltk
import nltk.sentiment.util
from nltk.corpus import stopwords

import nltk.sentiment.util
stop = set(stopwords.words("english"))
def extract_words_from_text(text):
    tokens = nltk.word_tokenize(text)
    tokens_neg_marked = nltk.sentiment.util.mark_negation(tokens)
    return [t for t in tokens_neg_marked
             if t.replace("_NEG", "").isalnum() and
             t.replace("_NEG", "") not in stop]

word_tokenize splits the text into a list of tokens (still keeping punctuation).

mark_negation marks tokens that come after a negation with _NEG. So, for example, "I did not enjoy this." becomes this after tokenization and marking negations:

["I", "did", "not", "enjoy_NEG", "this_NEG", "."].

The last line removes all stop words (including the negated ones) and punctuation. There are still many useless words in the text, such as "I" or "This", but this filtering suffices for our demonstration.

Next, we construct a list of all words that were read from the review files. We keep a separate list of positive and negative words, to ensure balance once we take the top words. (I also tested it without keeping the word lists separate, and then it turned out that the majority of positive reviews got classified as negative.) At the same time, we can also create lists of all positive reviews and all negative reviews.

import os

positive_files = os.listdir("aclImdb/train/pos")
negative_files = os.listdir("aclImdb/train/neg")

positive_words = []
negative_words = []

positive_reviews = []
negative_reviews = []

for pos_file in positive_files:
    with open("aclImdb/train/pos/" + pos_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        positive_reviews.append(txt)
        positive_words.extend(extract_words_from_text(txt))
for neg_file in negative_files:
    with open("aclImdb/train/neg/" + neg_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        negative_reviews.append(txt)
        negative_words.extend(extract_words_from_text(txt))

Running this code can take a while because there are plenty of files.

Then, we keep only the top N words (in this example, 2000 words) from both the positive and negative word lists and combine them.

N = 2000

freq_pos = nltk.FreqDist(positive_words)
top_word_counts_pos = sorted(freq_pos.items(), key=lambda kv: kv[1], reverse=True)[:N]
top_words_pos = [twc[0] for twc in top_word_counts_pos]

freq_neg = nltk.FreqDist(negative_words)
top_word_counts_neg = sorted(freq_neg.items(), key=lambda kv: kv[1], reverse=True)[:N]
top_words_neg = [twc[0] for twc in top_word_counts_neg]

top_words = list(set(top_words_pos + top_words_neg))

Now we can write a feature extractor. As mentioned earlier, it should return a dictionary with each top word as a key and either True or False as value, depending on whether the word is present in the text.

def extract_features(text):
    text_words = extract_words_from_text(text)
    return { w: w in text_words for w in top_words }

We then create a training set, which we’ll feed to the Naive Bayes Classifier. The training set should be a list of tuples where each tuple's first element is the feature set and the second element is the label.

training = [(extract_features(review), "pos") for review in positive_reviews] + [(extract_features(review), "neg") for review in negative_reviews]

The above line consumes a lot of RAM and is slow, so you may want to use a subset of the reviews instead, by taking a slice of the review lists.

Training a classifier is simple:

classifier = nltk.NaiveBayesClassifier.train(training)

To classify a review now, use the classify method on a new feature set:

print(classifier.classify(extract_features("Your review goes here.")))

If you want to see the probabilities per label, use prob_classify instead:

def get_prob_dist(text):
    prob_dist = classifier.prob_classify(extract_features(text))
    return { "pos": prob_dist.prob("pos"), "neg": prob_dist.prob("neg") }

print(get_prob_dist("Your review goes here."))

The classifier has a built-in way to determine the accuracy of the model, based on a test set. This test set is shaped in the same way as the training set. The movie review dataset has a separate directory with reviews that can be used for this purpose.

test_positive = os.listdir("aclImdb/test/pos")[:2500]
test_negative = os.listdir("aclImdb/test/neg")[:2500]

test = []

for pos_file in test_positive:
    with open("aclImdb/test/pos/" + pos_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        test.append((extract_features(txt), "pos"))
for neg_file in test_negative:
    with open("aclImdb/test/neg/" + neg_file, "r") as f:
        txt = f.read().replace("<br />", " ")
        test.append((extract_features(txt), "neg"))

print(nltk.classify.accuracy(classifier, test))

Using N = 2000, with 5000 positive reviews and 5000 negative reviews in the training set, I got to an accuracy of about 85% with this code.

Conclusion

In this module, we looked at how NLTK works in text classification, demonstrated using sentiment analysis. You can use it in the same way for other classifications, including those with more than two labels.

In the next module, we'll look at Keras.