“In mathematics you don’t understand things. You just get used to them.”
John von Neumann
When the path from data to results was first introduced in Figure 1-1, it may not have been clear how there would ever be a way forward. Throughout this book, we have focused on introducing basic principles of feature engineering using toy models and clean, simple datasets. These examples were intended to be illustrative and enlightening.
Machine learning examples generally show the best-case scenario and results. This masks the path we have described thus far in the book. Now that the foundation is set, we are leaving the world of simple, toy data and diving into the process of feature engineering with a real-world, structured dataset. As we move through each step, we will be examining the raw data forming each feature, what the transformed feature becomes, and what trade-offs we make along the way.
To be clear, our goal for this example is not to build the best model for this dataset. Rather, it is to demonstrate the practical application of a handful of our techniques, as well as how to more deeply examine and understand whether each technique is providing value to the model one is building.
Our task will be to build a recommender for academic papers using a subsample of the Microsoft Academic Graph dataset. This should come in extremely handy for all of you who are searching for citations but have not yet discovered Google Scholar. Here are some relevant statistics about the dataset:
The dataset is designed to be easy to store and access in a database. It is not tidy for machine learning models out of the box, but requires some initial wrangling. Some teachers like to spare you this step, boosting your ego by getting directly to the models and results. None of that here. We are starting together from the very beginning.
Our initial approach will be to wrangle a few variables into the right shape to push through an item-based collaborative filter. We will see if reasonably similar papers can be found in a timely and efficient manner.
This approach was first developed at Amazon as an improvement to user-based algorithms for recommending products. Sarawar et al. (2001) walk through the challenges and benefits of switching the perspective in recommenders from the user to the item.
Item-based collaborative filtering provides recommendations based on the similarity between items. This works in two stages: first finding the similarity scores between items, then ranking all scores to find the top-N similar item recommendations.
Like all good science experiments, we will start off with a hypothesis. In this case, we assume that papers published at about the same time and in similar fields of study will be the most useful to users. We will take a naive approach of parsing out these fields from a subsample of the overall dataset. After generating simple sparse arrays, we’ll run the entire item array through an item-based collaborative filter to see if we get good results.
The item-based collaborative filter depends on a similarity score to compare items. In this case, the cosine similarity provides a reasonable comparison between two non-zero vectors. The following example actually uses the cosine distance, which is the complement of the cosine similarity in the positive space, or:
DC(A,B) = 1 – SC(A,B)
where DC is the cosine distance and SC is the cosine similarity.
The first step in our journey is to import and examine the dataset. In Example 9-1, we scope our experiment by limiting the fields available after the initial import. These fields are still rich in possibility, as shown in Figure 9-1.
>>>
import
pandas
as
pd
>>>
model_df
=
pd
.
read_json
(
'data/mag_papers_0/mag_subset20K.txt'
,
lines
=
True
)
>>>
model_df
.
shape
(20000, 19)
>>>
model_df
.
columns
Index(['abstract', 'authors', 'doc_type', 'doi', 'fos', 'id', 'issue',
'keywords', 'lang', 'n_citation', 'page_end', 'page_start', 'publisher',
'references', 'title', 'url', 'venue', 'volume', 'year'],
dtype='object')
# filter out non-English articles and focus on a few variables
>>>
model_df
=
model_df
[
model_df
.
lang
==
'en'
]
...
.
drop_duplicates
(
subset
=
'title'
,
keep
=
'first'
)
...
.
drop
([
'doc_type'
,
'doi'
,
'id'
,
'issue'
,
'lang'
,
'n_citation'
,
...
'page_end'
,
'page_start'
,
'publisher'
,
'references'
,
...
'url'
,
'venue'
,
'volume'
],
...
axis
=
1
)
>>>
model_df
.
shape
(10399, 6)
Table 9-1 summarizes best how further wrangling is needed to get the raw data into a better shape for a model. Lists and dictionaries are good for data storage, but are not tidy or well suited for machine learning without some unpacking (Wickham, 2014).
Field name | Description | Field type | # NaN |
---|---|---|---|
abstract | paper abstract | string | 4393 |
authors | author names and affiliations | list of dict, keys = name, org | 1 |
fos | fields of study | list of strings | 1733 |
keywords | keywords | list of strings | 4294 |
title | paper title | string | 0 |
year | published year | int | 0 |
We focus first on two fields in Example 9-2, transforming them from lists and integers into a feature array, as shown in Figure 9-2.
>>>
unique_fos
=
sorted
(
list
({
feature
...
for
paper_row
in
model_df
.
fos
.
fillna
(
'0'
)
...
for
feature
in
paper_row
}))
>>>
unique_year
=
sorted
(
model_df
[
'year'
]
.
astype
(
'str'
)
.
unique
())
>>>
def
feature_array
(
x
,
var
,
unique_array
):
...
row_dict
=
{}
...
for
i
in
x
.
index
:
...
var_dict
=
{}
...
for
j
in
range
(
len
(
unique_array
)):
...
if
type
(
x
[
i
])
is
list
:
...
if
unique_array
[
j
]
in
x
[
i
]:
...
var_dict
.
update
({
var
+
'_'
+
unique_array
[
j
]:
1
})
...
else
:
...
var_dict
.
update
({
var
+
'_'
+
unique_array
[
j
]:
0
})
...
else
:
...
if
unique_array
[
j
]
==
str
(
x
[
i
]):
...
var_dict
.
update
({
var
+
'_'
+
unique_array
[
j
]:
1
})
...
else
:
...
var_dict
.
update
({
var
+
'_'
+
unique_array
[
j
]:
0
})
...
row_dict
.
update
({
i
:
var_dict
})
...
feature_df
=
pd
.
DataFrame
.
from_dict
(
row_dict
,
dtype
=
'str'
)
.
T
...
return
feature_df
>>>
year_features
=
feature_array
(
model_df
[
'year'
],
unique_year
)
>>>
fos_features
=
feature_array
(
model_df
[
'fos'
],
unique_fos
)
>>>
first_features
=
fos_features
.
join
(
year_features
)
.
T
>>>
from
sys
import
getsizeof
>>>
(
'Size of first feature array: '
,
getsizeof
(
first_features
))
Size of first feature array: 2583077234
We have now successfully turned a relatively small dataset, ~10K rows of raw data, into 2.5 GB of features. But this path is too sluggish for quick, iterative exploration. We need methods that will be faster and result in features that will consume less computational resources and experimentation time.
For now, though, let’s see how our current features perform at giving us a good recommendation in the next stage (Example 9-3). We’ll define a “good” recommendation as a paper that looks similar to the input.
>>>
from
scipy.spatial.distance
import
cosine
>>>
def
item_collab_filter
(
features_df
):
...
item_similarities
=
pd
.
DataFrame
(
index
=
features_df
.
columns
,
...
columns
=
features_df
.
columns
)
...
for
i
in
features_df
.
columns
:
...
for
j
in
features_df
.
columns
:
...
item_similarities
.
loc
[
i
][
j
]
=
1
-
cosine
(
features_df
[
i
],
...
features_df
[
j
])
...
return
item_similarities
>>>
first_items
=
item_collab_filter
(
first_features
.
loc
[:,
0
:
1000
])
Why does it take so long for us to calculate the item similarities using only two features? We are taking the dot product of a 10,399 × 1,000 matrix using a nested for
loop. The time per loop increases as we increase the number of observations we add to the model. Remember, this is a subset of the total available dataset, filtered for English-only papers. As we move closer to a “good” result, we’ll need to go back and test on the larger set for our best results.
How can we make this faster? Since we only need one result at a time, we can change our function so that we only calculate one item at a time, specifying the number of top results we want. We’ll do this later, as we continue to move through our experiment. For now, it is useful to see the full feature space to get an understanding of the impact of iterative work on brute-forcing our way through a real-world dataset.
We need to get a better idea of how these features will translate to us getting a good recommendation. Do we have enough observations to move forward? Let’s plot a heatmap (Example 9-4) to see if we have any papers that are similar to each other. Figure 9-3 shows the result.
>>>
import
matplotlib.pyplot
as
plt
>>>
import
seaborn
as
sns
>>>
import
numpy
as
np
>>>
%
matplotlib
inline
>>>
sns
.
set
()
>>>
ax
=
sns
.
heatmap
(
first_items
.
fillna
(
0
),
...
vmin
=
0
,
vmax
=
1
,
...
cmap
=
"YlGnBu"
,
...
xticklabels
=
250
,
yticklabels
=
250
)
>>>
ax
.
tick_params
(
labelsize
=
12
)
Darker pixels signal items that are similar to one another. The dark diagonal line shows that the cosine similarity is correctly indicating that each paper is most similar to itself. However, because there are a lot of NaNs for one of our features, the line is broken along the diagonal. We can see that while most of the items are not similar to one another—i.e., our dataset is fairly diverse—there are some other high-scoring candidates. These may or may not be good recommendations qualitatively, but at least we can see that our methods are not so mad.
Example 9-5 shows how to translate these item similarities into a recommendation. The good news is that we have a wide variety of features still available, with lots of room for improvement.
>>>
def
paper_recommender
(
paper_ix
,
items_df
):
...
(
'Based on the paper:
index = '
,
paper_ix
)
...
(
model_df
.
iloc
[
paper_ix
])
...
top_results
=
items_df
.
loc
[
paper_ix
]
.
sort_values
(
ascending
=
False
)
.
head
(
4
)
...
(
'
Top three results: '
)
...
order
=
1
...
for
i
in
top_results
.
index
.
tolist
()[
-
3
:]:
...
(
order
,
'. Paper index = '
,
i
)
...
(
'Similarity score: '
,
top_results
[
i
])
...
(
model_df
.
iloc
[
i
],
'
'
)
...
if
order
<
5
:
order
+=
1
>>>
paper_recommender
(
2
,
first_items
)
Based on the paper:
index = 2
abstract NaN
authors [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos NaN
keywords NaN
title Should endometriosis be an indication for intr...
year 2015
Name: 2, dtype: object
Top three results:
1 . Paper index = 2
Similarity score: 1.0
abstract NaN
authors [{'name': 'Jovana P. Lekovich', 'org': 'Weill ...
fos NaN
keywords NaN
title Should endometriosis be an indication for intr...
year 2015
Name: 2, dtype: object
2 . Paper index = 292
Similarity score: 1.0
abstract NaN
authors [{'name': 'John C. Newton'}, {'name': 'Beers M...
fos [Wide area multilateration, Maneuvering speed,...
keywords NaN
title Automatic speed control for aircraft
year 1955
Name: 561, dtype: object
3 . Paper index = 593
Similarity score: 1.0
abstract This paper demonstrates that on‐site greywater...
authors [{'name': 'Eran Friedler', 'org': 'Division of...
fos [Public opinion, Environmental Engineering, Wa...
keywords [economic analysis, tratamiento desperdicios, ...
title The water saving potential and the socio-econo...
year 2008
Name: 1152, dtype: object
Yikes. The good news is that the most similar paper returned is the one we are looking for. The bad news is that the next two papers don’t seem to be very close to our initial search, even for the features we have chosen.
“Yes, yes,” you may say, “but this is the era of Big Data! That will solve our problems! Can’t we just push more data through for better results?” Potentially. But even Big Data cannot compensate for poor data and engineering choices.
Our current brute-force methods are too slow for smart, iterative engineering. Let’s try some of our new feature engineering tricks to see if we can speed up the computation time and find better features and a better way to search for results.
The initial approach of creating a large, sparse array and shoving it through a filter can be improved in many ways. The next steps will focus specifically on applying better techniques to the two initial features and altering the item-based collaborative filter method for faster iteration.
First, it is time to try out some of those great feature engineering tricks for the two variables in our hypothesis. Looking deeper into the features already developed, we can choose techniques that will address each type of variable and convert it to a “better” feature for our recommendation system.
Let’s focus on the year first. In “Quantization or Binning”, we reviewed how using raw counts for features can be problematic for methods using similarity metrics. Example 9-6 (and Figure 9-5) will examine how we can transform 'year'
to better fit the model we have selected.
>>>
(
"Year spread: "
,
model_df
[
'year'
]
.
min
(),
" - "
,
model_df
[
'year'
]
.
max
())
>>>
(
"Quantile spread:
"
,
model_df
[
'year'
]
.
quantile
([
0.25
,
0.5
,
0.75
]))
Year spread: 1831 - 2017
Quantile spread:
0.25 1990.0
0.50 2005.0
0.75 2012.0
Name: year, dtype: float64
# plot years to see the distribution
>>>
fig
,
ax
=
plt
.
subplots
()
>>>
model_df
[
'year'
]
.
hist
(
ax
=
ax
,
...
bins
=
model_df
[
'year'
]
.
max
()
-
model_df
[
'year'
]
.
min
())
>>>
ax
.
tick_params
(
labelsize
=
12
)
>>>
ax
.
set_xlabel
(
'Year Count'
,
fontsize
=
12
)
>>>
ax
.
set_ylabel
(
'Occurrence'
,
fontsize
=
12
)
We can see from the skewed distribution (Figure 9-5) that this is an excellent candidate for binning.
The bins will be based on ranges within the variable, rather than the unique number of features. To further reduce the feature space, we will dummy-code the resultant bins (see Example 9-7). Pandas can do both using built-in functions. These methods will make our results easy to interpret, so we can do a quick check of the transformed features before moving on (see Figure 9-6).
# binning here (by 10 years) reduces the year feature space from 156 to 19
>>>
bins
=
int
(
round
((
model_df
[
'year'
]
.
max
()
-
model_df
[
'year'
]
.
min
())
/
10
))
>>>
temp_df
=
pd
.
DataFrame
(
index
=
model_df
.
index
)
>>>
temp_df
[
'yearBinned'
]
=
pd
.
cut
(
model_df
[
'year'
]
.
tolist
(),
bins
,
precision
=
0
)
>>>
X_yrs
=
pd
.
get_dummies
(
temp_df
[
'yearBinned'
])
>>>
X_yrs
.
columns
.
categories
IntervalIndex([(1831.0, 1841.0], (1841.0, 1851.0], (1851.0, 1860.0],
(1860.0, 1870.0], (1870.0, 1880.0] ... (1968.0, 1978.0],
(1978.0, 1988.0], (1988.0, 1997.0], (1997.0, 2007.0],
(2007.0, 2017.0]]
closed='right',
dtype='interval[float64]')
# plot the new distribution
>>>
fig
,
ax
=
plt
.
subplots
()
>>>
X_yrs
.
sum
()
.
plot
.
bar
(
ax
=
ax
)
>>>
ax
.
tick_params
(
labelsize
=
8
)
>>>
ax
.
set_xlabel
(
'Binned Years'
,
fontsize
=
12
)
>>>
ax
.
set_ylabel
(
'Counts'
,
fontsize
=
12
)
We have preserved the underlying distribution of the original variable through binning by decades. If we desired to use a method that would benefit from a different distribution, we could alter our binning choices to change how this variable presents itself to the model. Since we are using cosine similarity, this is fine. Let’s move on to the next feature we originally included in our model.
The fields-of-study feature space contributed significantly to the original model’s size and processing time.
Let’s examine the work we have already done. By parsing out the list of strings, we created a “bag-of-phrases” in the first pass. Since we already have a useful sparse array, we can focus on using a more efficient data type. Example 9-8 illustrates how converting from a Pandas DataFrame to a NumPy sparse array affects computation time.
>>>
X_fos
=
fos_features
.
values
# We can see how this will make a difference in the future by looking
# at the size of each
>>>
(
'Our pandas Series, in bytes: '
,
getsizeof
(
fos_features
))
>>>
(
'Our hashed numpy array, in bytes: '
,
getsizeof
(
X_fos
))
Our pandas Series, in bytes: 2530632380
Our hashed numpy array, in bytes: 112
Much better! Putting it back together, we’ll pipe our features together (Example 9-9) and rerun our recommender (Example 9-10) to see if we have improved results, taking advantage of scikit-learn’s cosine similarity function. We will also reduce the computational time by only focusing on one item at a time.
>>>
second_features
=
np
.
append
(
X_fos
,
X_yrs
,
axis
=
1
)
>>>
(
"The power of feature engineering saves us, in bytes: "
,
...
getsizeof
(
first_features
)
-
getsizeof
(
second_features
))
The power of feature engineering saves us, in bytes: 168066769
>>>
from
sklearn.metrics.pairwise
import
cosine_similarity
>>>
def
piped_collab_filter
(
features_matrix
,
index
,
top_n
):
...
item_similarities
=
...
1
-
cosine_similarity
(
features_matrix
[
index
:
index
+
1
],
...
features_matrix
)
.
flatten
()
...
related_indices
=
...
[
i
for
i
in
item_similarities
.
argsort
()[::
-
1
]
if
i
!=
index
]
...
return
[(
index
,
item_similarities
[
index
])
...
for
index
in
related_indices
...
][
0
:
top_n
]
>>>
def
paper_recommender
(
items_df
,
paper_ix
,
top_n
):
...
if
paper_ix
in
model_df
.
index
:
...
(
'Based on the paper:'
)
...
(
'Paper index = '
,
model_df
.
loc
[
paper_ix
]
.
name
)
...
(
'Title :'
,
model_df
.
loc
[
paper_ix
][
'title'
])
...
(
'FOS :'
,
model_df
.
loc
[
paper_ix
][
'fos'
])
...
(
'Year :'
,
model_df
.
loc
[
paper_ix
][
'year'
])
...
(
'Abstract :'
,
model_df
.
loc
[
paper_ix
][
'abstract'
])
...
(
'Authors :'
,
model_df
.
loc
[
paper_ix
][
'authors'
],
'
'
)
...
# define the location index for the DataFrame index requested
...
array_ix
=
model_df
.
index
.
get_loc
(
paper_ix
)
...
top_results
=
piped_collab_filter
(
items_df
,
array_ix
,
top_n
)
...
(
'
Top'
,
top_n
,
'results: '
)
...
order
=
1
...
for
i
in
range
(
len
(
top_results
)):
...
(
order
,
'. Paper index = '
,
...
model_df
.
iloc
[
top_results
[
i
][
0
]]
.
name
)
...
(
'Similarity score: '
,
top_results
[
i
][
1
])
...
(
'Title :'
,
model_df
.
iloc
[
top_results
[
i
][
0
]][
'title'
])
...
(
'FOS :'
,
model_df
.
iloc
[
top_results
[
i
][
0
]][
'fos'
])
...
(
'Year :'
,
model_df
.
iloc
[
top_results
[
i
][
0
]][
'year'
])
...
(
'Abstract :'
,
model_df
.
iloc
[
top_results
[
i
][
0
]][
'abstract'
])
...
(
'Authors :'
,
model_df
.
iloc
[
top_results
[
i
][
0
]][
'authors'
],
...
'
'
)
...
if
order
<
top_n
:
order
+=
1
...
else
:
...
(
'Whoops! Choose another paper. Try something from here:
'
,
...
model_df
.
index
[
100
:
200
])
>>>
paper_recommender
(
second_features
,
2
,
3
)
Based on the paper:
Paper index = 2
Title : Should endometriosis be an indication for intracytoplasmic sperm inject ...
FOS : nan
Year : 2015
Abstract : nan
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, ...
Top 3 results:
1 . Paper index = 10055
Similarity score: 1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ...
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : nan
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}]
2 . Paper index = 11771
Similarity score: 1.0
Title : A Study of Special Functions in the Theory of Eclipsing Binary Systems
FOS : ['Contact binary']
Year : 1981
Abstract : nan
Authors : [{'name': 'Filaretti Zafiropoulos', 'org': 'University of Manchester'}]
3 . Paper index = 11773
Similarity score: 1.0
Title : Studies of powder flow using a recording powder flowmeter and measure ...
FOS : nan
Year : 1985
Abstract : This paper describes the utility of the dynamic measurement of the ...
Authors : [{'name': 'Ramachandra P. Hegde', 'org': 'Department of Pharmacy, ...
To be honest, I don’t think our feature selection is working out too well. There is a lot of missing data in these fields. Let’s keep going to see if we can choose richer features with more information.
Our experiment thus far is not supporting the original hypothesis that year and fields-of-study would be sufficient to recommend a similar paper. At this point, we have a few options:
The first option makes the assumption that the problem is in our sampling of the data. This might be the case, but is similar to Figure 9-4’s analogy of stirring the data pile for better results.
The second option would give a better idea of the underlying raw data. This should be continually revisited based on how your decisions for features and model selection change during the exploration process. The initial subsample chosen here reflects this step. Since we have more variables available in the dataset, we will not go back here yet.
This leaves the third option, moving forward on our current model by adding more features. Providing more information about each item can improve the similarity scores and result in better recommendations.
Based on our initial exploration, the next steps will focus on the fields with the most information, abstract and authors.
Looking back at Chapter 4, we can see that abstract is a good candidate for tf-idf to filter through the noise and find the salient associative words. We do this in Example 9-12.
# need to fill in NaN for sklearn use in future
>>>
filled_df
=
model_df
.
fillna
(
'None'
)
>>>
from
sklearn.feature_extraction.text
import
TfidfVectorizer
>>>
vectorizer
=
TfidfVectorizer
(
sublinear_tf
=
True
,
max_df
=
0.5
,
...
stop_words
=
'english'
)
>>>
X_abstract
=
vectorizer
.
fit_transform
(
filled_df
[
'abstract'
])
>>>
third_features
=
np
.
append
(
second_features
,
X_abstract
.
toarray
(),
axis
=
1
)
We can reduce the computational load of the messy and uneven authors by wrangling into a dictionary and then running it through a one-hot encoder, as shown in Example 9-13.
>>>
authors_list
=
[]
>>>
for
row
in
filled_df
.
authors
.
itertuples
():
...
# create a dictionary from each Series index
...
if
type
(
row
.
authors
)
is
str
:
...
y
=
{
'None'
:
row
.
Index
}
...
if
type
(
row
.
authors
)
is
list
:
...
# add these keys + values to our running dictionary
...
y
=
dict
.
fromkeys
(
row
.
authors
[
0
]
.
values
(),
row
.
Index
)
...
authors_list
.
append
(
y
)
>>>
authors_list
[
0
:
5
]
[{'None': 0},
{'Ahmed M. Alluwaimi': 1},
{'Jovana P. Lekovich': 2, 'Weill Cornell Medical College, New York, NY': 2},
{'George C. Sponsler': 5},
{'M. T. Richards': 7}]
>>>
from
sklearn.feature_extraction
import
DictVectorizer
>>>
v
=
DictVectorizer
(
sparse
=
False
)
>>>
D
=
authors_list
>>>
X_authors
=
v
.
fit_transform
(
D
)
>>>
fourth_features
=
np
.
append
(
third_features
,
X_authors
,
axis
=
1
)
Time to check in with the recommender to see how these new features are working out. Example 9-14 shows the results.
>>>
paper_recommender
(
fourth_features
,
2
,
3
)
Based on the paper:
Paper index = 2
Title : Should endometriosis be an indication for intracytoplasmic sperm inject ...
FOS : nan
Year : 2015
Abstract : nan
Authors : [{'name': 'Jovana P. Lekovich', 'org': 'Weill Cornell Medical College, ...
Top 3 results:
1 . Paper index = 10055
Similarity score: 1.0
Title : [Diagnosis of cerebral tumors; comparative studies on arteriography, ...
FOS : ['Radiology', 'Pathology', 'Surgery']
Year : 1953
Abstract : nan
Authors : [{'name': 'Antoine'}, {'name': 'Lepoire'}, {'name': 'Schoumacker'}]
2 . Paper index = 5601
Similarity score: 1.0
Title : 633 Survival after coronary revascularization, with and without mitral ...
FOS : ['Cardiology']
Year : 2005
Abstract : nan
Authors : [{'name': 'J.B. Le Polain De Waroux'}, {'name': 'Anne-Catherine ...
3 . Paper index = 12256
Similarity score: 1.0
Title : Nucleotide Sequence and Analysis of an Insertion Sequence from Bacillus ...
FOS : ['Biology', 'Molecular biology', 'Insertion sequence', 'Nucleic acid ...
Year : 1994
Abstract : A 5.8-kb DNA fragment encoding the cryIC gene from Bacillus thur...
Authors : [{'name': 'Geoffrey P. Smith'}, {'name': 'David J. Ellar'}, {'name': ...
Even accounting for missing data in certain fields, the top three results from the last round of feature engineering are directing us to other papers in the medical field.
The range of papers represented in this dataset is broad; for example, a random sample of papers exposed fields of study such as “Coupling constant,” “Evapotranspiration,” “Hash function,” “IVMS,” “Meditation,” “Pareto analysis,” “Second-generation wavelet transform,” “Slip,” and “Spiral galaxy.” Given that there are 7,604 unique fields of study listed for 10K+ papers, these last results seem to be moving in the right direction. We can be confident that our work is progressing toward a useful model.
Continued iteration on more text variables, such as the finding the noun phrases of the paper titles or stemming the keywords, could bring us even closer to a “best” recommendation.
It should be noted here that this definition of “best” is the Holy Grail of all recommenders and search engines alike. We are searching for what a user will find most helpful, which may or may not be directly represented by the data. Feature engineering allows us to abstract salient features into representations such that algorithms can expose both the explicit and implicit information contained therein.
As you can see, building models for machine learning is easy. Building good models for useful results takes time and work. We hiked through the messy processes here of examining a collection of possible variables and experimenting with different feature engineering methods to achieve better results. We define “better” here not just in terms of good outcomes from our training and testing, but also reducing the size of the model and the time it takes us to iterate over different experiments.
We started this book by talking about how mastery of a subject comes from deeply learning the principles at work, in order to gain intuition to effectively put your knowledge to work. We hope that our work has given you the necessary tools to become more efficient and effective, as well as enriched your mathematical and computational understanding of how feature engineering is an essential skill to develop useful machine learning models.
Sarwar, Badrul, George Karypis, Joseph Konstan, and John Riedl. “Item-Based Collaborative Filtering Recommendation Algorithms.” Proceedings of the 10th International Conference on the World Wide Web (2001) 285–295.
Sinha, Arnab, Zhihong Shen, Yang Song, Hao Ma, Darrin Eide, Bo-June (Paul) Hsu, and Kuansan Wang. “An Overview of Microsoft Academic Service (MAS) and Applications.” Proceedings of the 24th International Conference on the World Wide Web (2015): 243–246.
Tang, Jie, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. “ArnetMiner: Extraction and Mining of Academic Social Networks.” Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2008): 990–998.
Wickham, Hadley. “Tidy Data.” The Journal of Statistical Software 59 (2014).
3.129.45.150