Clustering using MiniBatch K-means clustering 

In this section, we are going to use one of the unsupervised learning algorithms, that is, clustering. To be specific, we are going to cluster texts based on an algorithm named MiniBatch K-means clustering algorithm. Let's get some context regarding this.

Whenever a researcher starts working on any particular domain, they perform various literature reviews to comprehend the state of the art in any particular domain. Such a study is referred to as a review paper. When writing such review papers, you set up a set of search keywords and execute the search in many research paper indexing databases, such as scholar.google.com (https://scholar.google.com/). After performing the search in several databases, you will have a list of relevant articles that you want to study. In this case, we have performed the search and the lists of relevant articles have been provided in the form of an Excel sheet. Note that each row in the Excel file contains some metadata about the related paper.

You can find out more about the MiniBatch K-means clustering algorithm by looking at the official documentation of the sklearn library: https://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

Having understood the context, let's load the dataset into our notebook. This should be no mystery to us by now:

  1. Let's load the Excel file: 
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import seaborn as sns

sns.set()
plt.rcParams['figure.figsize'] = (14, 7)

df = pd.read_excel("https://github.com/sureshHARDIYA/phd-resources/blob/master/Data/Review%20Paper/acm/preprocessed.xlsx?raw=true")
  1. Next, let's check the first 10 entries to understand what the data looks like:
df.head(10)

The output of the preceding code is as follows:

As we can see, there are several columns. We are only interested in the title of the research paper. Therefore, we'll only focus on the Title column. 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.47.169