Building a chatbot

Now, having seen what's possible in terms of chatbots, you most likely want to build the best, most state-of-the-art, Google-level bot out there, right? Well, just put that out of your mind for now because we're going start by doing the exact opposite. We're going to build the most amazingly awful bot ever!

This may sound disappointing, but if your goal is just to build something very cool and engaging (that doesn't take hours and hours to construct), this is a great place to start.

We're going to leverage the training data derived from a set of real conversations with Cleverbot. The data was collected from http://notsocleverbot.jimrule.com. This site is perfect, as it has people submit the most absurd conversations they had with Cleverbot.

Let's take a look at a sample conversation between Cleverbot and a user from the site:

While you are free to use the techniques for web scraping that we used in earlier chapters to collect the data, you can find a .csv of the data in the GitHub repo for this chapter.

We'll start again in our Jupyter Notebook. We'll load, parse, and examine the data. We'll first import pandas and the Python regular expressions library, re. We're also going to set the option in pandas to widen our column width so we can see the data better:

import pandas as pd 
import re 
pd.set_option('display.max_colwidth',200)

Now we'll load in our data:

df = pd.read_csv('nscb.csv') 
df.head()

The preceding code results in the following output:

Since we're only interested in the first column, the conversation data, we'll parse that out:

convo = df.iloc[:,0] 
convo

The preceding code results in the following output:

You should be able to make out that we have interactions between User and Cleverbot, and that either can initiate the conversation. To get the data in the format we need, we'll have to parse it into question-and-response pairs. We aren't necessarily concerned with who says what, but with matching up each response to each question. You'll see why in a bit. Let's now do a bit of regular expression magic on the text:

clist = [] 
def qa_pairs(x): 
    cpairs = re.findall(": (.*?)(?:$|
)", x) 
    clist.extend(list(zip(cpairs, cpairs[1:]))) 
 
convo.map(qa_pairs); 
convo_frame = pd.Series(dict(clist)).to_frame().reset_index() 
convo_frame.columns = ['q', 'a']

The preceding code results in the following output:

OK, lots of code there. What just happened? We first created a list to hold our question-and-response tuples. We then passed our conversations through a function to split them into those pairs using regular expressions.

Finally, we set it all into a pandas DataFrame with columns labelled q and a.

We're now going to apply a bit of algorithm magic to match up the closest question to the one a user inputs:

from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.metrics.pairwise import cosine_similarity 
 
vectorizer = TfidfVectorizer(ngram_range=(1,3)) 
vec = vectorizer.fit_transform(convo_frame['q'])

In the preceding code, we imported our tf-idf vectorization library and the cosine similarity library. We then used our training data to create a tf-idf matrix. We can now use this to transform our own new questions and measure the similarity to existing questions in our training set. Let's do that now:

my_q = vectorizer.transform(['Hi. My name is Alex.']) 
 
cs = cosine_similarity(my_q, vec) 
 
rs = pd.Series(cs[0]).sort_values(ascending=False) 
top5 = rs.iloc[0:5] 
top5

The preceding code results in the following output:

What are we looking at here? This is the cosine similarity between the question I asked and the top-five closest questions. On the left is the index, and on the right is the cosine similarity. Let's take a look at those:

convo_frame.iloc[top5.index]['q']

This results in the following output:

As you can see, nothing is exactly the same, but there are definitely some similarities.

Let's now take a look at the response:

rsi = rs.index[0] 
rsi 
 
convo_frame.iloc[rsi]['a']

The preceding code results in the following output:

OK, so our bot seems to have an attitude already. Let's push further.

We'll create a handy function so that we can test a number of statements easily:

def get_response(q): 
    my_q = vectorizer.transform([q]) 
    cs = cosine_similarity(my_q, vec) 
    rs = pd.Series(cs[0]).sort_values(ascending=False) 
    rsi = rs.index[0] 
    return convo_frame.iloc[rsi]['a'] 
 
get_response('Yes, I am clearly more clever than you will ever be!')

This results in the following output:

We have clearly created a monster, so we'll continue:

get_response('You are a stupid machine. Why must I prove anything to    
              you?')

This results in the following output:

I'm enjoying this. Let's keep rolling with it:

get_response('Did you eat tacos?')

get_response('With beans on top?')

get_response('What else do you like to do?')

get_response('What do you like about it?')

get_response('Me, random?')

get_response('I think you mean you're')

Remarkably, this may be one of the best conversations I've had in a while, bot or not.

Now while this was a fun little project, let's now move on to a more advanced modeling technique using sequence-to-sequence modeling.

Table of Contents for Building a chatbot

Create new playlist

Sign In

Sign Up

Table of Contents for
Building a chatbot