[tut 2] Lina ChatBot - Question Answering using Parsing CKY & Web Scrapping

The Zakies

4.80/5 (5 votes)

Sep 12, 2017

CPOL

11 min read

15602

483

Question answer chatbot using natural language parsing and web scrapping

Series Introduction

tut1 retrieval based chatbot

This is Part 2 of a series of multiple tuts about natural language and machine learning, we would manually create a chatbot that is able to:

chat with users
answer their questions
extract data about users
perform some actions (recommending movie, telling time ....)

The goal of this series is not just to create a chatbot, but it is actually about viewing one practical project that can be achieved using NLP & ML, as my primary goal would be to present both ML & NLP in an interesting way which is to create a chatbot.

In this series, we would cover some techniques, both simple and advanced to reach our final goal. They are:

Document retrieval using TF-IDF
Document retrieval using Word2Vect
cky Parsing
web scraping

I hope that through this series, you would come to love NLP & ML through this fun project.

We would like to call our Chatbot Lina :D

Introduction

This is Part 2 of the series describing how to build a chatbot. Today, we would discus how to build a system for:

identifying questions from normal sentences
looking for answer online

To see last tut describing how to create a retrieval based chatbot, check this link.

We would use primarily two techniques for this:

natural language parsing using an algorithm named CKY
web scrapping from multiple websites using beautiful soap library

In this tut, we would give a quick brief on parsing as general, and a quick brief of how the cky natural language parsing algorithm works, and then we would start the fun part of working on our ChatBot to make use out of parsing and web scrapping.

Attached are two files:

tut2 only
tut1 + tut2

Technical Background Agenda (a)

This section is optional if you need to know more about the subject. If you need to get started with the code, check Implementation.

Why parsing & not classification & not simple regex
Why parsing
What is PCFG
What is CKY parsing algorithm

Implementation Agenda (b)

If you would like to skip the technical background and start with the implementation, take a look at the below links:

Cky algorithm from stat_parser
Parsing getting tree output then traverse to get string
Regex through already known parses
Calling Duck Duck Go API
Web scraping from Yahoo answers
Call the functions in a while 1 loop
Connect to prev tut

References

CKY implementation by emilmont
Dan Jurafsky & Chris NLP Playlist

Libraries

Technical Background

1a - Why Parsing & Not Classification & Not Simple Regex?

We would need to identify questions, so for doing so, if we used classification techniques (any classification algorithm like Naive bayes, for example), we would require to train our classifier on question structure (like questions of how many, what is color of, who is .....)

But it doesn't stop at this, as your classifier must understand that:

how much in the same class as how many and
how old in the same class as how big
what is in the same class as who is

This would end up having an enormous number of train sentences, or if you tried regex, you would end up having an enormous number of rules that actually are the same.

So both methods of classification & simple regular expressions won't achieve being highly efficient.

2a - Why Parsing?

As we have seen in the previous examples:

how much & how many are actually the same, as much & many are both adjectives
how old & how big are actually the same, as old & big are both adjectives, actually that is exactly the same as previous point
what is & who is are actually the same, as much & many are both adjectives

So we can like replace the actual word of (big , old ....) and replace them by another word (called terminal) that describes that this word is an adjective.

This is precisely what the parsing does, it tries to know what this belongs to, is it a noun, verb, adjective, adverb.

This task is called Natural Language Parsing, as there are many uses for parsing, as any high level language has its own parsing system to convert it to assembly code, but for our task here, we would use parsing but on Natural Language.

There is a well known algorithm for this called CKY, but it works with PCFG , but what are PCFG?

3a - What is PCFG ?

PCFG stands for probability context free grammar:

grammar stands for they are the grammar of words inside a language, how the words are arranged inside a language, like the rules of a certain language
context free stands for that grammar is built in a non intersecting way
probability stands for how probable a certain grammar rule is over the other

All rules here form what we call grammar.

It simply describes that there are two ways to create a subject.

When NP (noun phrase), then a VP (verb Phrase) comes after it. This would create a subject, and this rule counts for a probability of 0.9.
Just by having a verb phrase, this would create a subject with prob of 0.1.

The same goes for all other non terminals.

4a - What is CKY Parsing Algorithm?

CKY parsing algorithm stands for the method of parsing natural language sentences , it is developed using dynamic programming , it uses PCFG that we have previously discussed.

We would go briefly on the very concpet of how it is built , as building the cky algorithm itself is not our goal , our goal is to put it in a good use in our chatbot , but it is extremely important to know how it is built , so lets begin !!

There are two main points for CKY to work properly:

CKY only uses PCFG that are Binary (that only contains 2 terms on right hand side) so if you have a rule with more than 2 terms on the right hand side, you must break them to multiple binary rules.
Assume we have a dictionary which contains each word and the prob that it is a noun, verb, adjective .....

Assume we need to parse a sentence "people fish in the sea". To make it simpler, we would only parse as if the sentence was "people fish".

Assume give grammar (PCFG grammar) is:

CKY works with something called a parse triangle (chart). It is just a way of visualizing how cky parses the sentence:

Here, we assume that:

People is (from the predefined dictionary of words with their probability):
- Noun Phrase with prob of 0.35
- verb with prob of 0.1
- Noun with prob of 0.5 ==> this makes sense as people is after all a noun
Fish is (from the predefined dictionary of words with their probability)
- Verb Phrase with prob of 0.06
- Noun Phrase with prob of 0.14
- verb with prob of 0.6 ==> this makes sense as people is after all a verb
- Noun with prob of 0.2

But, when choosing our pair of terminals (VP, NP, V, N), we don't just choose the max out of all prob, but we test on all Possible grammar rules pairs to get max.

So in the previous example, we chose to begin testing with NP → NP NP , since both people and fish can be NP (that is with different prob, of course).

So we would go to the next higher level to get the prob that such a rule could be generated, so to do so, we multiply:

prob of people being NP
prob of fish being NP
prob of the rule NP → NP NP itself

This would result in 0.35 x 0.14 x 0.1 = 0.0049.

But we won't just stop here, we must continue till we loop through all possible rules that "people fish" can make.

So let's choose VP → V NP. This would output 0.007 (we simply write NP as our previous guess):

Then continue on with choosing NP & VP to create S.

S → NP VP, this would create a prob of 0.0189 (we also write the previous guesses).

BUT in

CKY we also take unary PCFG into consideration, so in each step (after the first non-terminal
step, you see if the output of this rule can be generated by a unary rule or not).

S → VP of prob 0.0007

So we have two methods for creating S:

either using both NP and VP with prob 0.0189
or by only using the already generated VP with prob 0.0007

So we only use the one with the highest prob and discard the other.

So now, after looping through all possible combinations, we have 3 terminals as our output, 3 possible guesses for how to parse "people fish".

noun phrase with prob 0.0049
verb phrase with prob 0.007
complete sentence with prob 0.0189

Here, CKY would choose to parse "people fish" as complete sentence is composed of noun phrase "people" and a verb phrase "fish".

We then use this concept through the whole parse triangle until we reach its top. At each step, we go to the next upper level till we reach the top of the triangle. In this way, we would have achieved a way of parsing sentences in square of time not in an exponential one.

Implementation

1b - cky Algorithm from stat_parser

There is a truly smart implementation developed by emilmont found on github for cky parsing. It is truly unique as it doesn't contain pre defined pcfg, but it learns it from QuestionBank and Penn treebanks.

This makes the learnt PCFG truly realistic, not simple arbitrary probabilities for each rule, but true learnt values. This reflects perfectly on the results that this code outputs.

2b - Parsing to Get Tree Output Then Traverse to Get String

First, we would import the code for cky & nltk library as NLTK would help in storing of the tree:

# _____for parsing_____
from stat_parser.parser import Parser, display_tree
import nltk

Let's define a function that takes a sentence and displays the output of parse as a tree:

def parse(sentence):

    # first we create a parser obj
    parser = Parser()
    
    # put parsing in a try and catch 
    try:
        # actually parse the sentence
        tree = parser.parse(sentence)
    except:
        return False, ""

    # display the tree as string
    print tree

    # display the tree in a gui form
    display_tree(tree)

This would be the tree gui when parsing "can you tell me who is Obama".

But:

to actually work on the output, it would be much easier if we could have the output just:

- MD - PRP - VB - PRP - WP - VBZ - NNP -

So we would need to traverse through the tree, so we use a recursive function for traversing:

def parse(sentence):

   parser = Parser()

   try:
     tree = parser.parse(sentence) 
   except: 
     return False, ""
 
   print tree
   display_tree(tree)

   #just call the function through passing tree[i]

    for i in range(len(tree)):
        traverse(tree[i], 0)

Now let's see the function itself:

# this function uses a global variable array to store the leaves
tree_output = []

# and a dictionary to save nouns
imp_list_array = {'Noun': []}


def traverse(parent, x):
    # all our code is written in a try catch clauses
    try:
        # we loop through all children within the passed node
        for node in parent:

            # if this node is a tree , then loop till you reach the leaves
            if type(node) is nltk.Tree:

                # if the node is the root  ==> do nothing, 
                # as we will have reached our final goal
                if node.label() == 'ROOT':
                    # "======== Sentence ========="
                    # print "Sentence:", " ".join(node.leaves()) , " +  
                    # type " , node.label()
                    a = 6


                # else , call the function again on this node ==> till reaching leaves
                else:
                    element_type = node.label()
                    element_value = node.leaves()[0]
                    element_sentence = node.leaves()

                    # just a condition i have written t extract nouns
                    if str(element_type) == 'NN' or str(element_type) == \
                           'NNS' or str(element_type) == 'NNP' or str(
                            element_type) == 'NNPS':
                        imp_list_array['Noun'].append(str(element_value))

                   
                    # call the function again , recursively
                    traverse(node, x)
            else:
                # actually add the leaves to the array 
                tree_output.append(parent.label())
 


    # if the above code failed , simply do nothing
    except:

        tree_output.append('NN')

Since we are using global variables, it would be better if we clear these variables at each parse.

def parse(sentence):
    while len(tree_output) > 0:
        tree_output.pop()

    parser = Parser()
    try:
        tree = parser.parse(sentence)
        #print tree
    except:
        return False, ""

    # display_tree(tree)
    #print("parse succeeded")

    for i in range(len(tree)):
        traverse(tree[i], 0)

3b - Regex through Already Known Parses

def parse(sentence):
    while len(tree_output) > 0:
        tree_output.pop()

    parser = Parser()
    try:
        tree = parser.parse(sentence)
         
    except:
        return False, ""  

    for i in range(len(tree)):
        traverse(tree[i], 0)
 
    tree_output_str = ""

    # first we would convert the array to a string 
    for a in tree_output:
        tree_output_str += " - " + a

    print  tree_output_str

    # here we would save the parses that we are interested in
    special_parses = [
        "WRB - JJ - NNS",  # how many Leopards
        "WRB - JJ - JJ",  # how many leopards
        "WRB - JJ - VBP - DT - NN",  # how big are the pyramids
        "WRB - JJ - VBZ - JJ",  # how old is obama
        "WRB - JJ - VBZ - NNP",  # how old is Obama
        "WRB - JJ - NN - VBP - NNP - VBP",  # how much money do Bill have
        "WRB - VBP - DT - NN",  # where are the pyramids
        
        "WP - VBD - DT - NN",  # who won the champions last week  #when was the 
                                                                  #tv first invented
     
        "WP - VBP - DT - NN",  # what are the pyramids
        "WP - VBZ - DT - NN - IN - NN",  # what is the capital of egypt

        "WDT - NNS",  # which companies are the biggest ,
        "WRB - VBZ - NN",  # where is egypt
        "WRB - VBZ - DT - NN" ,    # where is the usa

        "WP - VBZ - NNP",  # what is Egypt
        "WP - VBZ - JJ",  # what is egypt
        "WRB - VBD - NNP",  # when did Bayern
        "WP - VBZ - NN" ,  # what is indonesian      

         "WP - VBZ - DT - JJ - NN - IN - NN"  #who is the current president of usa
    ]

    try:
        # add all elements in special_parses as a regex expression
        # putting "|" between them
        regex = reduce(lambda x, y: x + "|" + y, special_parses)
         
        # do the actual regex
        # and get the position where the question is within the sentence
        # as a user can ask "could you tell me who is Obama"
        # we would only need to get "who is Obama"
        pos_tree_output = tree_output_str.index\
                          (re.search(regex, tree_output_str).group(0))
        pos_var = len(tree_output_str.replace('-', '').split()) - len(
            tree_output_str[pos_tree_output:].replace('-', '').split())
         
        # do the extraction itself from the index known from previous lines
        fact_question = ' '.join(sentence.split()[pos_var:])

        print("it is a fact question")
        # ----scrap solution from Internet-------#
        # ----we would scrap from 2 sources------#
        
        # duck duck go
        answr_text = look_in_duck_go(fact_question)
                                                                       
        #if not found in duck duck go look inside yahoo answers
        if answr_text=="":
            # yahoo answers
            answr_text = look_in_yahoo_answers(fact_question)
            
        return True, answr_text

    except:
        print("not fact question")
        return False, "not fact question"

4b - Calling Duck Duck Go API

Duck Duck Go provides a simple API (without even having API key) to get answers to questions, but we first must import some libraries:

# _____for scrapping_____
import re
from bs4 import BeautifulSoup
import requests

Duck Duck Go returns answer either in XML or JSON, we would choose XML, and we would use BeautifulSoup to extract data out of it.

The XML response would look like:

We would extract the abstract tag, and image tag.

But if abstract tag is empty, we would try to look for text tag.

Now let's write the function itself:

def look_in_duck_go(fact_question):
      
      try:
          # we would call the api , attach the question , ask for xml format
          base_address = "https://api.duckduckgo.com/?q=" + fact_question + "&format=xml"
 
          # get the webpage
          super_page = requests.get(base_address)

          # pass it to Beautiful soap
          soup_super_page = BeautifulSoup(super_page.content, "xml")


          # extract all Abstract tag , and use the first one
          answer = soup_super_page.findAll('Abstract')[0].text

          # extract the image
          Image = soup_super_page.findAll('Image')[0].text

          # if the abstract tag is empty , look for the text tag
          if (answer == ""):
              answer = soup_super_page.findAll('Text')[0].text
          print("found in duck duck go")
      
     except:
          answer=""
      return  answer

5b - Web Scraping from Yahoo Answers

Duck Duck Go only answers what & who questions, so there would be many other questions left unanswered, so we would use Yahoo answers for that, given its huge community, most of the questions can be found.

This is how you can use Yahoo answers:

https://answers.search.yahoo.com/search?fr=uh3_answers_vert_gs&type=2button&p=your_question

This would return:

We would use the answer found inside the first link.

This is the link in HTML:

 <a class="<code> lh-17</code> fz-m"
 href="https://answers.yahoo.com/question/index;_ylt=AwrC1C7U5rdZzxMA1QFPmolQ;
_ylu=X3oDMTEybTFvb2wxBGNvbG8DYmYxBHBvcwMxBHZ0aWQDQjI1NTlfMQRzZWMDc3I-?
     qid=20091221094648AAFlRbE" 
referrerpolicy="unsafe-url" target="_blank" id="yui_3_10_0_1_1505224373621_408">

<b>how</b> 
<b id="yui_3_10_0_1_1505224373621_431">many</b> 
<b>pyramids</b>
<b>in</b> 
<b id="yui_3_10_0_1_1505224373621_407">Egypt</b>
?

</a>

What is really important is the class name which is lh-17

We extract the link (href).

So our function would be:

def look_in_yahoo_answers(fact_question):

      # to replace empty spaces with %20
      fact_question = fact_question.replace(' ' , '%20')

      # write the website path
      base_address ="https://answers.search.yahoo.com/search?
                     fr=uh3_answers_vert_gs&type=2button&p="+fact_question
     
      #access it
      super_page = requests.get(base_address) 

      # load it into BeautifulSoup
      soup_super_page = BeautifulSoup(super_page.content, "lxml")

      # get the first link ==> <code>lh-17
     </code># and extract the href attribute
      answr_link = soup_super_page.findAll('a', {'class' : 'lh-17'})[0].attrs["href"]

      #--------------------------------------------------------#
      #--------------- we still need to get real answer--------#
      #--------------------------------------------------------# 

      return answr_text

Then after getting the link of the answer, this is how it would look like:

We would need to get the best answer which looks in HTML like below:

<span itemprop="text" class="<code>ya-q-full-text</code>" 
id="yui_3_17_2_5_1505225015420_1884">
 
There are more than one hundred pyramids. <br>
The reason the exact number is not fixed is that
 there are unfinished or not in good shape pyramids. 
So it is debatable for some of them if they would be counted as pyramids or not. <br>
The perfect ones are the three pyramids of Giza. 

</span>

We are interested in the class name which is ya-q-full-text.

So we extract its text:

def look_in_yahoo_answers(fact_question):
      fact_question = fact_question.replace(' ' , '%20')

      base_address ="https://answers.search.yahoo.com/search?
                     fr=uh3_answers_vert_gs&type=2button&p="+fact_question
     
      super_page = requests.get(base_address)#, headers=headers)

      soup_super_page = BeautifulSoup(super_page.content, "lxml")

      answr_link = soup_super_page.findAll('a', {'class' : 'lh-17'})[0].attrs["href"]

      #-------------------------------------------------#

      # get the page corresponding to the link we have just extracted
      answer_page = requests.get(answr_link)

      # load page into BeautifulSoup
      soup_answer_page = BeautifulSoup(answer_page.content, "lxml")

      # extract the answer
      answr_text = soup_answer_page.findAll('span', 
                   {'class' : '<code>ya-q-full-text</code>'})[0].text

      print("found in yahoo answers")

      # return the answer
      return answr_text

6b - Call the Functions in a While 1 Loop

Let's put all the above code in a user friendly interface:

while 1:

    # empty the global variables
    tree_output=[]
    tree_output_str =""

    fact_question = raw_input("talk to Lina : ")

    # just call the parse function
    answr_text = parse(fact_question)


    print answr_text
    print

7b - Connect to Prev tut

Download _tut_1___2_Lina__tf-idf___parsing_scrapping__.zip

Here, we would create a simple Python script that would use tut1 & tut2, to create a more complete chatbot. This chatbot would now be able to:

chat with users using retrieval modal using TF-IDF (see prev tut for more info)
answer user's questions using parsing & web scraping

So let's start with a Python script called tut_1_2.

import tf_idf # tut1
import parsing_scrapping  # tut2

We would comment all while 1 in both tf_idf & parsing_scrapping (disable while 1 in inner code) and create a new while 1 in tut_1_2.

We will start with parsing module before using TF-IDF, as if the question is a fact question this module would easily identify it as fact question while TF-IDF would try its best to respond as if it is a normal chat, we don't need this, so we would start with the parsing module:

while 1:
    sentence = raw_input("talk to Lina : ")
    
    # start with parsing & scrapping
    # empty the global variables
    parsing_scrapping.tree_output=[]
    parsing_scrapping.tree_output_str =""
   
    # call the parsing module
    # it would return 2objs
    # answr_text[0] bool either fact question or not
    # answr_text[1] the answer 
    answr_text = parsing_scrapping.parse(sentence) 

    # if it is a fact question
    if answr_text[0] :
         print answr_text[1]

    # if normal chat
    else :
        response_primary, line_id_primary = tf_idf.talk_to_lina_primary(sentence)
        print response_primary

    print

Note:

Please tell me what do you think about this tut, tell me your review and vote this tut. I also would suggest seeing last tut, and wait for coming tuts isA