Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Back to the Feature: Building an Academic Paper Recommender

“In mathematics you don’t understand things. You just get used to them.”

John von Neumann

When the path from data to results was first introduced in Figure 1-1, it may not have been clear how there would ever be a way forward. Throughout this book, we have focused on introducing basic principles of feature engineering using toy models and clean, simple datasets. These examples were intended to be illustrative and enlightening.

Machine learning examples generally show the best-case scenario and results. This masks the path we have described thus far in the book. Now that the foundation is set, we are leaving the world of simple, toy data and diving into the process of feature engineering with a real-world, structured dataset. As we move through each step, we will be examining the raw data forming each feature, what the transformed feature becomes, and what trade-offs we make along the way.

To be clear, our goal for this example is not to build the best model for this dataset. Rather, it is to demonstrate the practical application of a handful of our techniques, as well as how to more deeply examine and understand whether each technique is providing value to the model one is building.

Item-Based Collaborative Filtering

Our task will be to build a recommender for academic papers using a subsample of the Microsoft Academic Graph dataset. This should come in extremely handy for all of you who are searching for citations but have not yet discovered Google Scholar. Here are some relevant statistics about the dataset:

The dataset is designed to be easy to store and access in a database. It is not tidy for machine learning models out of the box, but requires some initial wrangling. Some teachers like to spare you this step, boosting your ego by getting directly to the models and results. None of that here. We are starting together from the very beginning.

Our initial approach will be to wrangle a few variables into the right shape to push through an item-based collaborative filter. We will see if reasonably similar papers can be found in a timely and efficient manner.

The Origins of Item-Based Collaborative Filtering

This approach was first developed at Amazon as an improvement to user-based algorithms for recommending products. Sarawar et al. (2001) walk through the challenges and benefits of switching the perspective in recommenders from the user to the item.

Item-based collaborative filtering provides recommendations based on the similarity between items. This works in two stages: first finding the similarity scores between items, then ranking all scores to find the top-N similar item recommendations.

First Pass: Data Import, Cleaning, and Feature Parsing

Like all good science experiments, we will start off with a hypothesis. In this case, we assume that papers published at about the same time and in similar fields of study will be the most useful to users. We will take a naive approach of parsing out these fields from a subsample of the overall dataset. After generating simple sparse arrays, we’ll run the entire item array through an item-based collaborative filter to see if we get good results.

The item-based collaborative filter depends on a similarity score to compare items. In this case, the cosine similarity provides a reasonable comparison between two non-zero vectors. The following example actually uses the cosine distance, which is the complement of the cosine similarity in the positive space, or:

D_C(A,B) = 1 – S_C(A,B)

where D_C is the cosine distance and S_C is the cosine similarity.

Academic Paper Recommender: Naive Approach

The first step in our journey is to import and examine the dataset. In Example 9-1, we scope our experiment by limiting the fields available after the initial import. These fields are still rich in possibility, as shown in Figure 9-1.

Example 9-1. Import + filter data

>>> import pandas as pd

>>> model_df = pd.read_json('data/mag_papers_0/mag_subset20K.txt', lines=True)
>>> model_df.shape
(20000, 19)
>>> model_df.columns
Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
       'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
       'references', 'title', 'url', 'venue', 'volume', 'year'],
      dtype='object')

# filter out non-English articles and focus on a few variables
>>> model_df = model_df[model_df.lang == 'en']
...            .drop_duplicates(subset='title', keep='first')
...            .drop(['doc_type', 'doi', 'id', 'issue', 'lang', 'n_citation', 
...                   'page_end', 'page_start', 'publisher', 'references',
...                   'url', 'venue', 'volume'], 
...                  axis=1)
>>> model_df.shape
(10399, 6)

Table 9-1 summarizes best how further wrangling is needed to get the raw data into a better shape for a model. Lists and dictionaries are good for data storage, but are not tidy or well suited for machine learning without some unpacking (Wickham, 2014).

Table 9-1. Data schema for model_df
Field name	Description	Field type	# NaN
abstract	paper abstract	string	4393
authors	author names and affiliations	list of dict, keys = name, org	1
fos	fields of study	list of strings	1733
keywords	keywords	list of strings	4294
title	paper title	string	0
year	published year	int	0

We focus first on two fields in Example 9-2, transforming them from lists and integers into a feature array, as shown in Figure 9-2.

Example 9-2. Collaborative filtering stage 1: Build item feature matrix

>>> unique_fos = sorted(list({feature
...                           for paper_row in model_df.fos.fillna('0')
...                           for feature in paper_row }))

>>> unique_year = sorted(model_df['year'].astype('str').unique())
>>> def feature_array(x, var, unique_array):
...     row_dict = {}
...     for i in x.index:
...         var_dict = {}
...         for j in range(len(unique_array)):
...             if type(x[i]) is list:
...                 if unique_array[j] in x[i]:
...                     var_dict.update({var + '_' + unique_array[j]: 1})
...                 else:
...                     var_dict.update({var + '_' + unique_array[j]: 0})
...             else:    
...                 if unique_array[j] == str(x[i]):
...                     var_dict.update({var + '_' + unique_array[j]: 1})
...                 else:
...                     var_dict.update({var + '_' + unique_array[j]: 0})     
...         row_dict.update({i : var_dict})
...     feature_df = pd.DataFrame.from_dict(row_dict, dtype='str').T
...     return feature_df

>>> year_features = feature_array(model_df['year'], unique_year)
>>> fos_features = feature_array(model_df['fos'], unique_fos)

>>> first_features = fos_features.join(year_features).T

>>> from sys import getsizeof
>>> print('Size of first feature array: ', getsizeof(first_features))
Size of first feature array: 2583077234

We have now successfully turned a relatively small dataset, ~10K rows of raw data, into 2.5 GB of features. But this path is too sluggish for quick, iterative exploration. We need methods that will be faster and result in features that will consume less computational resources and experimentation time.

For now, though, let’s see how our current features perform at giving us a good recommendation in the next stage (Example 9-3). We’ll define a “good” recommendation as a paper that looks similar to the input.

Example 9-3. Collaborative filtering stage 2: Search for similar items

>>> from scipy.spatial.distance import cosine

>>> def item_collab_filter(features_df):
...     item_similarities = pd.DataFrame(index = features_df.columns, 
...                                      columns = features_df.columns) 
...     for i in features_df.columns:
...         for j in features_df.columns:
...             item_similarities.loc[i][j] = 1 - cosine(features_df[i], 
...                                                      features_df[j])
...     return item_similarities

>>> first_items = item_collab_filter(first_features.loc[:, 0:1000])

Why does it take so long for us to calculate the item similarities using only two features? We are taking the dot product of a 10,399 × 1,000 matrix using a nested for loop. The time per loop increases as we increase the number of observations we add to the model. Remember, this is a subset of the total available dataset, filtered for English-only papers. As we move closer to a “good” result, we’ll need to go back and test on the larger set for our best results.

How can we make this faster? Since we only need one result at a time, we can change our function so that we only calculate one item at a time, specifying the number of top results we want. We’ll do this later, as we continue to move through our experiment. For now, it is useful to see the full feature space to get an understanding of the impact of iterative work on brute-forcing our way through a real-world dataset.

We need to get a better idea of how these features will translate to us getting a good recommendation. Do we have enough observations to move forward? Let’s plot a heatmap (Example 9-4) to see if we have any papers that are similar to each other. Figure 9-3 shows the result.

Example 9-4. Heatmap of paper recommendations

>>> import matplotlib.pyplot as plt
>>> import seaborn as sns
>>> import numpy as np
>>> %matplotlib inline
>>> sns.set()
>>> ax = sns.heatmap(first_items.fillna(0), 
...                  vmin=0, vmax=1, 
...                  cmap="YlGnBu", 
...                  xticklabels=250, yticklabels=250)
>>> ax.tick_params(labelsize=12)

Darker pixels signal items that are similar to one another. The dark diagonal line shows that the cosine similarity is correctly indicating that each paper is most similar to itself. However, because there are a lot of NaNs for one of our features, the line is broken along the diagonal. We can see that while most of the items are not similar to one another—i.e., our dataset is fairly diverse—there are some other high-scoring candidates. These may or may not be good recommendations qualitatively, but at least we can see that our methods are not so mad.

Example 9-5 shows how to translate these item similarities into a recommendation. The good news is that we have a wide variety of features still available, with lots of room for improvement.

Example 9-5. Item-based collaborative filtering recommendations

>>> def paper_recommender(paper_ix, items_df):
...     print('Based on the paper: 
index = ', paper_ix)
...     print(model_df.iloc[paper_ix])
...     top_results = items_df.loc[paper_ix].sort_values(ascending=False).head(4)
...     print('
Top three results: ') 
...     order = 1
...     for i in top_results.index.tolist()[-3:]:
...         print(order,'. Paper index = ', i)
...         print('Similarity score: ', top_results[i])
...         print(model_df.iloc[i], '
')
...         if order < 5: order += 1

>>> paper_recommender(2, first_items)

Based on the paper:
index =  2
abstract                                                  NaN
authors     [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos                                                       NaN
keywords                                                  NaN
title       Should endometriosis be an indication for intr...
year                                                     2015
Name: 2, dtype: object

Top three results:
1 . Paper index =  2
Similarity score:  1.0
abstract                                                  NaN
authors     [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos                                                       NaN
keywords                                                  NaN
title       Should endometriosis be an indication for intr...
year                                                     2015
Name: 2, dtype: object

2 . Paper index =  292
Similarity score:  1.0
abstract                                                  NaN
authors     [{'name': 'John C. Newton'}, {'name': 'Beers M...
fos         [Wide area multilateration, Maneuvering speed,...
keywords                                                  NaN
title                    Automatic speed control for aircraft
year                                                     1955
Name: 561, dtype: object

3 . Paper index =  593
Similarity score:  1.0
abstract    This paper demonstrates that on‐site greywater...
authors     [{'name': 'Eran Friedler', 'org': 'Division of...
fos         [Public opinion, Environmental Engineering, Wa...
keywords    [economic analysis, tratamiento desperdicios, ...
title       The water saving potential and the socio-econo...
year                                                     2008
Name: 1152, dtype: object

Yikes. The good news is that the most similar paper returned is the one we are looking for. The bad news is that the next two papers don’t seem to be very close to our initial search, even for the features we have chosen.

“Yes, yes,” you may say, “but this is the era of Big Data! That will solve our problems! Can’t we just push more data through for better results?” Potentially. But even Big Data cannot compensate for poor data and engineering choices.

Our current brute-force methods are too slow for smart, iterative engineering. Let’s try some of our new feature engineering tricks to see if we can speed up the computation time and find better features and a better way to search for results.

Second Pass: More Engineering and a Smarter Model

The initial approach of creating a large, sparse array and shoving it through a filter can be improved in many ways. The next steps will focus specifically on applying better techniques to the two initial features and altering the item-based collaborative filter method for faster iteration.

First, it is time to try out some of those great feature engineering tricks for the two variables in our hypothesis. Looking deeper into the features already developed, we can choose techniques that will address each type of variable and convert it to a “better” feature for our recommendation system.

Academic Paper Recommender: Take 2

Let’s focus on the year first. In “Quantization or Binning”, we reviewed how using raw counts for features can be problematic for methods using similarity metrics. Example 9-6 (and Figure 9-5) will examine how we can transform 'year' to better fit the model we have selected.

Example 9-6. Fixed-width binning + dummy coding (part 1)

>>> print("Year spread: ", model_df['year'].min()," - ", model_df['year'].max())
>>> print("Quantile spread:
", model_df['year'].quantile([0.25, 0.5, 0.75]))
Year spread:  1831  -  2017
Quantile spread:
0.25    1990.0
0.50    2005.0
0.75    2012.0
Name: year, dtype: float64

# plot years to see the distribution
>>> fig, ax = plt.subplots()
>>> model_df['year'].hist(ax=ax, 
...                       bins= model_df['year'].max() - model_df['year'].min())
>>> ax.tick_params(labelsize=12)
>>> ax.set_xlabel('Year Count', fontsize=12)
>>> ax.set_ylabel('Occurrence', fontsize=12)

We can see from the skewed distribution (Figure 9-5) that this is an excellent candidate for binning.

The bins will be based on ranges within the variable, rather than the unique number of features. To further reduce the feature space, we will dummy-code the resultant bins (see Example 9-7). Pandas can do both using built-in functions. These methods will make our results easy to interpret, so we can do a quick check of the transformed features before moving on (see Figure 9-6).

Example 9-7. Fixed-width binning + dummy coding (part 2)

# binning here (by 10 years) reduces the year feature space from 156 to 19 
>>> bins = int(round((model_df['year'].max() - model_df['year'].min()) / 10))

>>> temp_df = pd.DataFrame(index = model_df.index)
>>> temp_df['yearBinned'] = pd.cut(model_df['year'].tolist(), bins, precision = 0)
>>> X_yrs = pd.get_dummies(temp_df['yearBinned'])
>>> X_yrs.columns.categories
IntervalIndex([(1831.0, 1841.0], (1841.0, 1851.0], (1851.0, 1860.0],
               (1860.0, 1870.0], (1870.0, 1880.0] ... (1968.0, 1978.0],
               (1978.0, 1988.0], (1988.0, 1997.0], (1997.0, 2007.0],
               (2007.0, 2017.0]]
              closed='right',
              dtype='interval[float64]')

# plot the new distribution
>>> fig, ax = plt.subplots()
>>> X_yrs.sum().plot.bar(ax = ax)
>>> ax.tick_params(labelsize=8)
>>> ax.set_xlabel('Binned Years', fontsize=12)
>>> ax.set_ylabel('Counts', fontsize=12)

We have preserved the underlying distribution of the original variable through binning by decades. If we desired to use a method that would benefit from a different distribution, we could alter our binning choices to change how this variable presents itself to the model. Since we are using cosine similarity, this is fine. Let’s move on to the next feature we originally included in our model.

The fields-of-study feature space contributed significantly to the original model’s size and processing time.

Let’s examine the work we have already done. By parsing out the list of strings, we created a “bag-of-phrases” in the first pass. Since we already have a useful sparse array, we can focus on using a more efficient data type. Example 9-8 illustrates how converting from a Pandas DataFrame to a NumPy sparse array affects computation time.

Example 9-8. Converting bag-of-phrases pd.Series to NumPy sparse array

>>> X_fos = fos_features.values

# We can see how this will make a difference in the future by looking
# at the size of each
>>> print('Our pandas Series, in bytes: ', getsizeof(fos_features))
>>> print('Our hashed numpy array, in bytes: ', getsizeof(X_fos))

Our pandas Series, in bytes:  2530632380
Our hashed numpy array, in bytes:  112

Much better! Putting it back together, we’ll pipe our features together (Example 9-9) and rerun our recommender (Example 9-10) to see if we have improved results, taking advantage of scikit-learn’s cosine similarity function. We will also reduce the computational time by only focusing on one item at a time.

Example 9-9. Collaborative filtering stages 1 + 2: Build item feature matrix, search for similar items

>>> second_features = np.append(X_fos, X_yrs, axis = 1)
>>> print("The power of feature engineering saves us, in bytes: ", 
...       getsizeof(first_features) - getsizeof(second_features))
The power of feature engineering saves us, in bytes:  168066769

>>> from sklearn.metrics.pairwise import cosine_similarity

>>> def piped_collab_filter(features_matrix, index, top_n):
...     item_similarities = 
...         1 - cosine_similarity(features_matrix[index:index+1], 
...                               features_matrix).flatten() 
...     related_indices = 
...         [i for i in item_similarities.argsort()[::-1] if i != index]
...     return [(index, item_similarities[index]) 
...             for index in related_indices
...            ][0:top_n]

Example 9-10. Item-based collaborative filtering recommendations: Take 2

>>> def paper_recommender(items_df, paper_ix, top_n):
...     if paper_ix in model_df.index:
...         print('Based on the paper:')
...         print('Paper index = ', model_df.loc[paper_ix].name)
...         print('Title :', model_df.loc[paper_ix]['title'])
...         print('FOS :', model_df.loc[paper_ix]['fos'])
...         print('Year :', model_df.loc[paper_ix]['year'])
...         print('Abstract :', model_df.loc[paper_ix]['abstract'])
...         print('Authors :', model_df.loc[paper_ix]['authors'], '
')
...         # define the location index for the DataFrame index requested
...         array_ix = model_df.index.get_loc(paper_ix)
...         top_results = piped_collab_filter(items_df, array_ix, top_n)
...         print('
Top',top_n,'results: ')

...         order = 1
...         for i in range(len(top_results)):
...             print(order,'. Paper index = ', 
...                   model_df.iloc[top_results[i][0]].name)
...             print('Similarity score: ', top_results[i][1])
...             print('Title :', model_df.iloc[top_results[i][0]]['title'])
...             print('FOS :', model_df.iloc[top_results[i][0]]['fos'])
...             print('Year :', model_df.iloc[top_results[i][0]]['year'])
...             print('Abstract :', model_df.iloc[top_results[i][0]]['abstract'])
...             print('Authors :', model_df.iloc[top_results[i][0]]['authors'], 
...                   '
')
...             if order < top_n: order += 1
...     else:
...         print('Whoops! Choose another paper. Try something from here: 
', 
...               model_df.index[100:200])

>>> paper_recommender(second_features, 2, 3)
Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm inject ...
FOS : nan
Year : 2015
Abstract : nan
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, ...

Top 3 results:
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ...
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : nan
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}]

2 . Paper index =  11771
Similarity score:  1.0
Title : A Study of Special Functions in the Theory of Eclipsing Binary Systems
FOS : ['Contact binary']
Year : 1981
Abstract : nan
Authors : [{'name': 'Filaretti Zafiropoulos', 'org': 'University of Manchester'}]

3 . Paper index =  11773
Similarity score:  1.0
Title : Studies of powder flow using a recording powder flowmeter and measure ...
FOS : nan
Year : 1985
Abstract : This paper describes the utility of the dynamic measurement of the ...
Authors : [{'name': 'Ramachandra P. Hegde', 'org': 'Department of Pharmacy, ...

To be honest, I don’t think our feature selection is working out too well. There is a lot of missing data in these fields. Let’s keep going to see if we can choose richer features with more information.

Finding Your Place

Converting between Pandas DataFrames and NumPy matrices can make indices tricky—we have the same size index, but the index assignments are not the same. Pandas assists with this using .iloc, .loc, and .get_loc, as we show in Example 9-11:

.loc returns the index based on the original Pandas DataFrame, allowing us to reference specific papers.
.iloc uses the integer location, which is the same index as our NumPy array.
.get_loc helps us find the integer location when we know the DataFrame index.

Example 9-11. Maintaining index assignment during conversions

>>> model_df.loc[21]
abstract    A microprocessor includes hardware registers t...
authors                      [{'name': 'Mark John Ebersole'}]
fos         [Embedded system, Parallel computing, Computer...
keywords                                                  NaN
title       Microprocessor that enables ARM ISA program to...
year                                                     2013
Name: 21, dtype: object

>>> model_df.iloc[21]
abstract                                                  NaN
authors     [{'name': 'Nicola M. Heller'}, {'name': 'Steph...
fos         [Biology, Medicine, Post-transcriptional regul...
keywords    [glucocorticoids, post transcriptional regulat...
title       Post-transcriptional regulation of eotaxin by ...
year                                                     2002
Name: 30, dtype: object

>>> model_df.index.get_loc(30)
21

Third Pass: More Features = More Information

Our experiment thus far is not supporting the original hypothesis that year and fields-of-study would be sufficient to recommend a similar paper. At this point, we have a few options:

Upload more of the original dataset to see if we get better results.
Spend more time exploring the data to examine if we have a sufficiently dense set to provide good recommendations.
Iterate on the current model by adding more features.

The first option makes the assumption that the problem is in our sampling of the data. This might be the case, but is similar to Figure 9-4’s analogy of stirring the data pile for better results.

The second option would give a better idea of the underlying raw data. This should be continually revisited based on how your decisions for features and model selection change during the exploration process. The initial subsample chosen here reflects this step. Since we have more variables available in the dataset, we will not go back here yet.

This leaves the third option, moving forward on our current model by adding more features. Providing more information about each item can improve the similarity scores and result in better recommendations.

Based on our initial exploration, the next steps will focus on the fields with the most information, abstract and authors.

Academic Paper Recommender: Take 3

Looking back at Chapter 4, we can see that abstract is a good candidate for tf-idf to filter through the noise and find the salient associative words. We do this in Example 9-12.

Example 9-12. Stopwords + tf-idf

# need to fill in NaN for sklearn use in future
>>> filled_df = model_df.fillna('None')

>>> from sklearn.feature_extraction.text import TfidfVectorizer

>>> vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5,
...                              stop_words='english')
>>> X_abstract = vectorizer.fit_transform(filled_df['abstract'])
>>> third_features = np.append(second_features, X_abstract.toarray(), axis = 1)

We can reduce the computational load of the messy and uneven authors by wrangling into a dictionary and then running it through a one-hot encoder, as shown in Example 9-13.

Example 9-13. One-hot encoding using scikit-learn’s DictVectorizer

>>> authors_list = []

>>> for row in filled_df.authors.itertuples():
...     # create a dictionary from each Series index
...     if type(row.authors) is str:
...         y = {'None': row.Index}
...     if type(row.authors) is list:
...         # add these keys + values to our running dictionary    
...         y = dict.fromkeys(row.authors[0].values(), row.Index)
...     authors_list.append(y)

>>> authors_list[0:5]
[{'None': 0},
{'Ahmed M. Alluwaimi': 1},
{'Jovana P. Lekovich': 2, 'Weill Cornell Medical College, New York, NY': 2},
{'George C. Sponsler': 5},
{'M. T. Richards': 7}]

>>> from sklearn.feature_extraction import DictVectorizer
>>> v = DictVectorizer(sparse=False)
>>> D = authors_list
>>> X_authors = v.fit_transform(D)
>>> fourth_features = np.append(third_features, X_authors, axis = 1)

Time to check in with the recommender to see how these new features are working out. Example 9-14 shows the results.

Example 9-14. Item-based collaborative filtering recommendations: Take 3

>>> paper_recommender(fourth_features, 2, 3)

Based on the paper:
Paper index =  2
Title : Should endometriosis be an indication for intracytoplasmic sperm inject ...
FOS : nan
Year : 2015
Abstract : nan
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, ...


Top 3 results:
1 . Paper index =  10055
Similarity score:  1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ...
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : nan
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}]

2 . Paper index =  5601
Similarity score:  1.0
Title : 633 Survival after coronary revascularization, with and without mitral ...
FOS : ['Cardiology']
Year : 2005
Abstract : nan
Authors : [{'name': 'J.B. Le Polain De Waroux'}, {'name': 'Anne-Catherine ...

3 . Paper index =  12256
Similarity score:  1.0
Title : Nucleotide Sequence and Analysis of an Insertion Sequence from Bacillus ...
FOS : ['Biology', 'Molecular biology', 'Insertion sequence', 'Nucleic acid ...
Year : 1994
Abstract : A 5.8-kb DNA fragment encoding the  cryIC  gene from  Bacillus thur...
Authors : [{'name': 'Geoffrey P. Smith'}, {'name': 'David J. Ellar'}, {'name': ...

Even accounting for missing data in certain fields, the top three results from the last round of feature engineering are directing us to other papers in the medical field.

The range of papers represented in this dataset is broad; for example, a random sample of papers exposed fields of study such as “Coupling constant,” “Evapotranspiration,” “Hash function,” “IVMS,” “Meditation,” “Pareto analysis,” “Second-generation wavelet transform,” “Slip,” and “Spiral galaxy.” Given that there are 7,604 unique fields of study listed for 10K+ papers, these last results seem to be moving in the right direction. We can be confident that our work is progressing toward a useful model.

Continued iteration on more text variables, such as the finding the noun phrases of the paper titles or stemming the keywords, could bring us even closer to a “best” recommendation.

It should be noted here that this definition of “best” is the Holy Grail of all recommenders and search engines alike. We are searching for what a user will find most helpful, which may or may not be directly represented by the data. Feature engineering allows us to abstract salient features into representations such that algorithms can expose both the explicit and implicit information contained therein.

Summary

As you can see, building models for machine learning is easy. Building good models for useful results takes time and work. We hiked through the messy processes here of examining a collection of possible variables and experimenting with different feature engineering methods to achieve better results. We define “better” here not just in terms of good outcomes from our training and testing, but also reducing the size of the model and the time it takes us to iterate over different experiments.

We started this book by talking about how mastery of a subject comes from deeply learning the principles at work, in order to gain intuition to effectively put your knowledge to work. We hope that our work has given you the necessary tools to become more efficient and effective, as well as enriched your mathematical and computational understanding of how feature engineering is an essential skill to develop useful machine learning models.

Bibliography

Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl. “Item-Based Collaborative Filtering Recommendation Algorithms.” Proceedings of the 10th International Conference on the World Wide Web (2001) 285–295.

Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. “An Overview of Microsoft Academic Service (MAS) and Applications.” Proceedings of the 24th International Conference on the World Wide Web (2015): 243–246.

Tang, Jie, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. “ArnetMiner: Extraction and Mining of Academic Social Networks.” Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008): 990–998.

Wickham, Hadley. “Tidy Data.” The Journal of Statistical Software 59 (2014).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
9. Back to the Feature: Building an Academic Paper Recommender

Chapter 9. Back to the Feature: Building an Academic Paper Recommender

Item-Based Collaborative Filtering

The Origins of Item-Based Collaborative Filtering

First Pass: Data Import, Cleaning, and Feature Parsing

Academic Paper Recommender: Naive Approach

Example 9-1. Import + filter data

Figure 9-1. First two rows of the Microsoft Academic Graph dataset

Example 9-2. Collaborative filtering stage 1: Build item feature matrix

Figure 9-2. Head of first_features—observations’ (papers') indices from the original data set are columns, features are rows

Example 9-3. Collaborative filtering stage 2: Search for similar items

Example 9-4. Heatmap of paper recommendations

Figure 9-3. Heatmap of similar papers based on two raw features: year and fields of study

Example 9-5. Item-based collaborative filtering recommendations

Figure 9-4. Machine learning (https://xkcd.com/1838/)

Second Pass: More Engineering and a Smarter Model

Academic Paper Recommender: Take 2

Example 9-6. Fixed-width binning + dummy coding (part 1)

Figure 9-5. Raw year distribution for 10K+ academic papers in dataset

Example 9-7. Fixed-width binning + dummy coding (part 2)

Figure 9-6. Distribution of new binned X_yrs feature

Example 9-8. Converting bag-of-phrases pd.Series to NumPy sparse array

Example 9-9. Collaborative filtering stages 1 + 2: Build item feature matrix, search for similar items

Example 9-10. Item-based collaborative filtering recommendations: Take 2

Third Pass: More Features = More Information

Academic Paper Recommender: Take 3

Example 9-12. Stopwords + tf-idf

Example 9-13. One-hot encoding using scikit-learn’s DictVectorizer

Example 9-14. Item-based collaborative filtering recommendations: Take 3

Summary

Bibliography

Table of Contents for 9. Back to the Feature: Building an Academic Paper Recommender

Create new playlist

Sign In

Sign Up

Chapter 9. Back to the Feature: Building an Academic Paper Recommender

Item-Based Collaborative Filtering

The Origins of Item-Based Collaborative Filtering

First Pass: Data Import, Cleaning, and Feature Parsing

Academic Paper Recommender: Naive Approach

Example 9-1. Import + filter data

Figure 9-1. First two rows of the Microsoft Academic Graph dataset

Example 9-2. Collaborative filtering stage 1: Build item feature matrix

Figure 9-2. Head of first_features—observations’ (papers') indices from the original data set are columns, features are rows

Example 9-3. Collaborative filtering stage 2: Search for similar items

Example 9-4. Heatmap of paper recommendations

Figure 9-3. Heatmap of similar papers based on two raw features: year and fields of study

Example 9-5. Item-based collaborative filtering recommendations

Figure 9-4. Machine learning (https://xkcd.com/1838/)

Second Pass: More Engineering and a Smarter Model

Academic Paper Recommender: Take 2

Example 9-6. Fixed-width binning + dummy coding (part 1)

Figure 9-5. Raw year distribution for 10K+ academic papers in dataset

Example 9-7. Fixed-width binning + dummy coding (part 2)

Figure 9-6. Distribution of new binned X_yrs feature

Example 9-8. Converting bag-of-phrases pd.Series to NumPy sparse array

Example 9-9. Collaborative filtering stages 1 + 2: Build item feature matrix, search for similar items

Example 9-10. Item-based collaborative filtering recommendations: Take 2

Third Pass: More Features = More Information

Academic Paper Recommender: Take 3

Example 9-12. Stopwords + tf-idf

Example 9-13. One-hot encoding using scikit-learn’s DictVectorizer

Example 9-14. Item-based collaborative filtering recommendations: Take 3

Summary

Bibliography

Table of Contents for
9. Back to the Feature: Building an Academic Paper Recommender