Lina ChatBot - Generating Response Using Document Retrieval TF-IDF

The Zakies

5.00/5 (6 votes)

Sep 10, 2017

CPOL

11 min read

21271

481

Build a retrieval based chatbot using TF-IDF

Download source - 80.1 KB

Series Introduction

This would be a series of multiple tuts about natural language and machine learning, we would manually create a chatbot table to

chat with users
answer their questions
extract data about users
perform some actions (recommending movie, telling time ...)

This series' goal is not just to create a chatbot, but it is actually about viewing one practical project that can be achieved using NLP & ML, as my primary goal would be to present both ML & NLP in an interesting way which is to create a chatbot.

In this series, we would cover some techniques, both simple and advanced to reach our final goal. They are:

Document retrieval using TF-IDF
Document retrieval using Word2Vect
cky Parsing
web scrapping

I hope that through this series, you would come to love NLP & ML through this fun project.

We would like to call our Chatbot Lina :D

Tut Introduction

To build a chatbot, there are two primary methods, generative method, and a retrieval method. For this series, we would work on a retrieval method, as it would be both simple and rewarding as it would output really good results

In this TUT, we would discuss how to use document retrieval to make Lina chat with users, in coming tuts, we would discuss how to make it more interactive.

To build using the retrieval method, there are some general steps:

Define some sentences and their responses (data gathering step), I have attached a simple sentence response from a repo available online called rDanny.
Teach your chatbot this dataset using TF-IDF (term freq -inv document freq).
Compare between the input sentence and the learnt using cosine similarity.

The attached zip contains both the code and the dataset.

Agenda

In this tut, we would:

(Optional) begin discussing briefly the concept of TF-IDF (term freq -inv document freq)
(Optional) then discuss what is cosine similarity and why do we use it
(You can skip the technical background and get ahead with creating your ChatBot) then we come to the interesting part which is really practically building Lina (our chatbot :D)

Most of the technicalities came from Dan Jurafsky & Chris youtube playlist.

And the used dataset came from rDanny.

So let's begin!

1. TF-IDF

tf-idf stands for term freq - inv document freq, it's a simple method to try to give scores to documents that look the same as the user's query.

It is built over two main concepts:

How many times this word (in the query ) appeared in the each document ==> term freq
How unique is this word ==> inv document freq

1.a Term Freq

It is a simple count of how many of a specific word appeared in a given document, we count all words in all given documents ==> to give scores to documents.

Example:

Here, in this example, we have multiple documents, here they are Shakespeare's plays, we test for specific word occurrences through these documents, as here when only working on Term freq, we are only interested in word counts:

Now we would need to measure score just according to term freq (word count), in coming sections, we would leverage the way we calculate the score BUT

Raw term frequency is not what we want as:

A document with 10 occurrences of the term is more relevant than a document with 1 occurrence of the term.
But not 10 times more relevant.

As Relevance does not increase proportionally with term frequency. SO we would use Log freq not just simple word count:

Remember that tf is just word count, so whenever tf > 0, we would calculate score as 1+Log, else if word count is zero (tf=0), score would be zero. So now to calculate score.

Term Freq is not enough

But we must account for how rare a certain word is if compared to other words through all documents. So to complete TF-IDF we must take into consideration document freq, or to be more precise its inverse to take into consideration the rarity of words

1.b Inverse Document Freq

Here, we tend to measure rarity of words, this is done to calculate how many documents a word appeared in. We count a word appeared once in a document once seen in this document (even if it appeared multiple times in this document), as our main goal here is counting documents not words.

Example:

Assume a million document and you look for a list of words and see how many documents these words appeared in.

df here stands for number of documents. So here a word like “calpurnia“ appeared in only one document (very rare) so it would have high score, so a small df (document count) gets a big score, so we get the inverse of document freq:

So by calculating this for all the above terms (words), results would be:

Note:

as you have seen here, IDF (inverse document freq) is independent to the query as it only depends on documents, so we don’t need to calculate idf for each new query, as there is one idf value for each term. This can be used for optimization, calculating idf prior to query

So if we want to measure score just according to IDF (inverse document freq):

1.c Combing both TF & IDF in a Simple Score Calculation

This score calculation needs to:

Inc with number of occurrences (TermFreq)
Inc with rarity of words (inverse Document Freq)

We first calculate weight for a specific term in a specific document.

Applying this to the previous example of Shakespeare's Play:

Now each document is represented by a real-valued vector of tf-idf weights.

Now to calculate score of a specific query versus a specific document.

t==> term (word) , q==> query , d==> document

So you calculate the score for all terms (t) in the query (q) (as tf =0 if this term wasn't found in the given document). And you sum scores for all terms in the query to get a score of this document.

Note:

Now after knowing how to calculate score for a specific document, so to do document retrieval we would need to do this method for all documents, but we tend to use concept of vector space not just simple score formula to get similarities, so we represent both query and document in a vector space so we use similarity calculations

1.d Why Vector Space Model ??

In the previous example, we can say that each document now becomes a vector of real values.

Taking Julius Caesar, for example, its vector would be ==> [ 3.18 6.1 2.54 1.54 0 0 0 ]

We can assume words (terms) to be the dimensions of our space, so now the document would be like a point (or a vector in this space, and we also can think of the queries the same way, as a point (vector) in the space.

BUT this space's can become of Very High Dimensions (terms of tens of millions), as remember our dimensions depend on the number of vocab you have, so we need a more efficient way than running the previous:

Ten million time for our 10 thousand document (assuming our vocab is 10 million and that we have 10 thousand document) and here comes the need to use methods that use the concept of vector space, one of them can simply be regarded as a way to calculate the smallest distance between 2 vectors (a specific document and a query ) in a big dimension space, this is called Euclidean distance.

Here, as we see, this is a tiny space of just two words (Gossip & Jealous), we have 3 documents and 1 query, the 3 documents:

d1 ==> has to do more with Gossiping
d3==> has to do more with Jealous
d2 ==> contains both terms

And it is shown here that q (our query) wants a document that contains both Gossip & Jealous (d2), but Euclidean distance won't return d2 as distance from q (from tip of the vector) to d2 (tip of d2) is bigger than that of d1 and d3.

So we use the concept of measuring angle rather than measuring distance.

So here comes the use of Cosine Similarity

2. Cosine Similarity

2.a Why Cosine Similarity?

There are many ways to use the concept of vector space, but we specifically use Cosine similarity as when using vector space, we would face a problem, which is different length documents would result in wrong scores (as discussed before), to solve this problem, we must consider using the concept of Length normalization, so this is the reason why we use cosine similarity

2.b Length Normalization

A vector can be length normalized by dividing each of its components by its length – for this, we use the L2 norm:

Now Long and short documents have comparable weights.

2.c Why Use cosine Function

We use cosine similarity because Cosine is a monotonically decreasing function for the interval [0o, 180o] and ranges from 1 → -1

The following two notions are equivalent.

Rank documents in decreasing order of the angle between query and document
Rank documents in increasing order of cosine (query, document) from the very nature of cosine function

2.d Cosine Similarity Computation

Now let's really compute the similarity using cosine similarity:

Remember that we calculate TF-IDF for both query and a specific document, then we use the cosine method to get similarity, so we primarily need two variables q and d:

qi is the TF-IDF weight of term i in the query
di is the TF-IDF weight of term i in the document

BUT

We are working on normalized length, so th eq can be simplified to be only dot product (or scalar product)

Example:

You can visualize it to only be:

Example:

3. Let's Build Our Chatbot!

All the above explanation was just to get to know the concept behind TF-IDF, as there are already many free tools that implement TF-IDF, the art is not just to create TF-IDF again, but to how to put it in good use.

We need for our chatbot a dataset in the form of sentences and their responses.

Our general steps would be:

Where each sentence would be assumed to be a document on its own
Then when a user inputs his chat, we would consider it to be our query
Then we use TF-IDF and cosine similarity to compare them both, and get the most similar sentence from the dataset that is most similar to the query and output the response

So Let's Start Coding!

First, we would need to import:

We would primarily use sklearn & numpy library (TF-IDF libraries)
We would also use some other libraries, their function would appear later (helper Libraries)

# _____TF-IDF libraries_____
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np

# _____helper Libraries_____
import pickle  # would be used for saving temp files
import csv     # used for accessing the dataset
import timeit  # to measure time of training
import random  # used to get a random number

Now let's create a function that takes a string for an input and outputs the response, our whole work would be inside this function.

Let's define the path of csv file, we put it inside folder named DataSet.

def talk_to_lina_primary(test_set_sentence):
   
    csv_file_path = "DataSet/randychat.csv"

To work with sklearn, we must load the string chat to a dictionary so, and we would create an array to load the csv file inside, so we would create an array called sentences.

def talk_to_lina_primary(test_set_sentence):
   
    csv_file_path = "DataSet/randychat.csv"

    i = 0
    sentences = []

    # enter your test sentence
    test_set = (test_set_sentence, "")

We would have two sections for our code, one for training and the other would be for running the code. We would use to save our objects in temp files, we would save these temp files in DataSet folder. So we would use a try except clauses for this, and put running phase in the try section so that when no temp files are present, we would go to the except section.

def talk_to_lina_primary(test_set_sentence):
   
    csv_file_path = "DataSet/randychat.csv"

    # we would use 2 temp files
    tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
    tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"

    i = 0
    sentences = []

    test_set = (test_set_sentence, "")

    # for indexes
    sentences.append(" No you.")
    sentences.append(" No you.")

    try:
        ##--------------to use------------------#
        # ----------------------------------------#
    except:
        # ---------------to train------------------#
        # -----------------------------------------#

For training, we would notice, that:

Identing the dimension of the space, is independant on the query, also
Learning TF-IDF of the dataset would also be independant on the query

So we can learn them once and then save them to two temp files:

def talk_to_lina_primary(test_set_sentence):
   
    csv_file_path = "DataSet/randychat.csv"
    tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
    tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"

    i = 0
    sentences = []

    # enter your test sentence
    test_set = (test_set_sentence, "")

    # for indexes
    sentences.append(" No you.")
    sentences.append(" No you.")

    try:
        ##--------------to use------------------#
        # ----------------------------------------#
    except:
        # ---------------to train------------------#
        # variable to see how much time it took to train 
        start = timeit.default_timer()

        # to load rows of csv into sentences array
        with open(csv_file_path, "r") as sentences_file:
            reader = csv.reader(sentences_file, delimiter=',')
            for row in reader:
                sentences.append(row[0])
                i += 1
    
        # now we would start on with training
        # begin with identing a TfidfVectorizer obj
        tfidf_vectorizer = TfidfVectorizer()

        # this line does both
        # 1- identify dimension of the space
        # 2- learn tf-idf of the dataset
        tfidf_matrix_train = tfidf_vectorizer.fit_transform(sentences) 


        # now the training is finished
      
        # now simply record time that it finished training
        stop = timeit.default_timer()
        print ("training time took was : ")
        print stop - start

        # then we would save these 2 objs (dimension space and tf-idf to temp files)
        # we use pickle lib

        # first save dimension space
        f = open(tfidf_vectorizer_pikle_path, 'wb')
        pickle.dump(tfidf_vectorizer, f)
        f.close()

        # then save tf-idf of dataset
        f = open(tfidf_matrix_train_pikle_path, 'wb')
        pickle.dump(tfidf_matrix_train, f)
        f.close()
        # -----------------------------------------#

Then we would work on the section of using (after learning):

def talk_to_lina_primary(test_set_sentence):
   
    csv_file_path = "DataSet/randychat.csv"
    tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
    tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"

    i = 0
    sentences = []
    test_set = (test_set_sentence, "")

    # for indexes
    sentences.append(" No you.")
    sentences.append(" No you.")

    try:
        ##--------------to use------------------#
        # first load dimension space
        f = open(tfidf_vectorizer_pikle_path, 'rb')
        tfidf_vectorizer = pickle.load(f)
        f.close()

        # then load tf-idf of dataset
        f = open(tfidf_matrix_train_pikle_path, 'rb')
        tfidf_matrix_train = pickle.load(f)
        f.close()
        # ----------------------------------------#
    except:
        # ---------------to train------------------#
        start = timeit.default_timer()

        # enter jabberwakky sentence
        with open(csv_file_path, "r") as sentences_file:
            reader = csv.reader(sentences_file, delimiter=',')
            for row in reader:
                sentences.append(row[0])
                i += 1

        tfidf_vectorizer = TfidfVectorizer()
        tfidf_matrix_train = tfidf_vectorizer.fit_transform(sentences) 
        stop = timeit.default_timer()
        print ("training time took was : ")
        print stop - start

        f = open(tfidf_vectorizer_pikle_path, 'wb')
        pickle.dump(tfidf_vectorizer, f)
        f.close()

        f = open(tfidf_matrix_train_pikle_path, 'wb')
        pickle.dump(tfidf_matrix_train, f)
        f.close()
        # -----------------------------------------#

Both sections of using and training output two objects, we then use those two objects to actually achieve similarity calculations.

We need:

Run TF-IDF on query using the learnt dimension space
Run cosine similarity between both TF-IDF of query & TF-IDF of dataset

def talk_to_lina_primary(test_set_sentence):
   
    csv_file_path = "DataSet/randychat.csv"
    tfidf_vectorizer_pikle_path = "DataSet/tfidf_vectorizer.pickle"
    tfidf_matrix_train_pikle_path ="DataSet/tfidf_matrix_train.pickle"

    i = 0
    sentences = []

    # enter your test sentence
    test_set = (test_set_sentence, "")

    # 3ashan yzabt el indexes
    sentences.append(" No you.")
    sentences.append(" No you.")

    try:
        ##--------------to use------------------#
        f = open(tfidf_vectorizer_pikle_path, 'rb')
        tfidf_vectorizer = pickle.load(f)
        f.close()

        f = open(tfidf_matrix_train_pikle_path, 'rb')
        tfidf_matrix_train = pickle.load(f)
        f.close()
        # ----------------------------------------#
    except:
        # ---------------to train------------------#
        start = timeit.default_timer()

        # enter jabberwakky sentence
        with open(csv_file_path, "r") as sentences_file:
            reader = csv.reader(sentences_file, delimiter=',')
            # reader.next()
            # reader.next()
            for row in reader:
                # if i==stop_at_sentence:
                #    break
                sentences.append(row[0])
                i += 1

        tfidf_vectorizer = TfidfVectorizer()
        tfidf_matrix_train = tfidf_vectorizer.fit_transform(sentences)  # finds the tfidf score with normalization
        # tfidf_matrix_test =tfidf_vectorizer.transform(test_set)
        stop = timeit.default_timer()
        print ("training time took was : ")
        print stop - start

        f = open(tfidf_vectorizer_pikle_path, 'wb')
        pickle.dump(tfidf_vectorizer, f)
        f.close()

        f = open(tfidf_matrix_train_pikle_path, 'wb')
        pickle.dump(tfidf_matrix_train, f)
        f.close()
        # -----------------------------------------#

    # use the learnt dimension space
    # to run TF-IDF on the query
    tfidf_matrix_test = tfidf_vectorizer.transform(test_set)

    # then run cosine similarity between the 2 tf-idfs
    cosine = cosine_similarity(tfidf_matrix_test, tfidf_matrix_train)
    cosine = np.delete(cosine, 0)
    
    # then get the max score
    max = cosine.max()
    response_index = 0

    # if score is more than 0.7
    if (max > 0.7): 
        # we can afford to get multiple high score documents to choose from
        new_max = max - 0.01
        
        # load them to a list
        list = np.where(cosine > new_max)
        
        # choose a random one to return to the user 
        # this happens to make Lina answers differently to same sentence
        response_index = random.choice(list[0])

    else:
        # else we would simply return the highest score
        response_index = np.where(cosine == max)[0][0] + 2 
       

    j = 0

    # loop to return the next cell on the row , ( the response cell )
    with open(csv_file_path, "r") as sentences_file:
        reader = csv.reader(sentences_file, delimiter=',')
        for row in reader:
            j += 1  # we begin with 1 not 0 &    j is initialized by 0
            if j == response_index:
                return row[1], response_index,
                break

Let's just call this function in a simple interface. We have created the function to return both the line of the response and the response itself , we would only view the response string to the user:

while 1:
    sentence = raw_input("talk to Lina : ")

    response_primary, line_id_primary = talk_to_lina_primary(sentence)
    print response_primary
    print

Hope this was fun, wait for the next tut. It would be even more exciting, we would make Lina able to answer user's questions, we would use:

CKY parsing
simple regex
web scrapping from using beautiful soap

duck duck go
yahoo answers

History

10^th September, 2017: Initial version
11^th March, 2021: Updated