Exploring feature engineering

Now that we made a system to fairly compare models with no fear of overfitting, let's think about how we can improve our model. One way would be to create new features that might add more context. One way to go about this is to create features of our own, for example, calculate a proportion of armies on different sides or the absolute difference in the number of soldiers—we can't say in advance which would work better. Let's try it out with the help of the following code:

  1. First, we'll create a ratio of soldiers on either side:
data['infantry_ratio'] = data['allies_infantry'] / data['axis_infantry']
cols.append('infantry_ratio')
  1. Now, we won't do that for tanks, planes, and so on, as the numbers here are very small and we'll have to deal with division by zero. Instead, we'll compute the difference in absolute numbers:
for tp in 'infantry', 'planes', 'tanks', 'guns':
data[f'{tp}_diff'] = data[f'allies_{tp}'] - data[f'axis_{tp}']
cols.append(f'{tp}_diff')
  1. Now that we have created those five new features, let's run our model over again:
scores = cross_val_scores(model, 
data[cols],
data['result_num'],
cv=4)
  1. Now, let's print the resultant score (fingers crossed):
>>> pd.np.mean(scores)
0.5141774891774892

Accuracy is now at 51.4%—almost 8% improvement over the previous 43.7%!

Let's see what else we can add to the mix. One feature we haven't used quite yet is the leaders columns—each containing a few names. Let's count how frequent each name is in the dataset and create a binary (one-hot) feature for the most frequently mentioned leaders. For that, we can use the Counter object we learned about in Section 1Getting Started with Python, of this book!

Consider the following code:

from collections import Counter

def _generate_binary_most_common(col, N=2):
mask = col.notnull()
lead_list = [ el.strip() for _, cell in col[mask].iteritems() for el in cell if el != '']
c = Counter(lead_list)

mc = c.most_common(N)
df = pd.DataFrame(index=col.index, columns=[name[0] for name in mc])

for name in df.columns:
df.loc[mask, name] = col[mask].apply(lambda x: name in x).astype(int)
return df.fillna(0)

The _generate_binary_most_common function, as follows, generates a dataframe with the top most frequent names as columns and the original data index. All values are binary, indicating whether each name is present in the original column.

 With that, we can add our new features to the dataset. Consider the following code: 

axis_pop = _generate_binary_most_common(data['axis_leaders'].str.split(','), N=2)
allies_pop = _generate_binary_most_common(data['allies_leaders'].str.split(','), N=2)

Here, we run the function we just created on the leaders columns for each side, with N=2. This results in a dataframe of two columns, filled with binary (0 and 1) values. These values represent whatever specific leader (one of the two most common) took part in each particular battle.

Now, all we need is to add those dataframes to our features and run cross-validation one more time:

data2 = pd.concat([data, axis_pop, allies_pop], axis=1)
cols2 = cols + axis_pop.columns.tolist() + allies_pop.columns.tolist()

scores = cross_val_score(model1,
data2[cols2],
data2['result_num'],
cv=4)
pd.np.mean(scores)
>>> 0.5369047619047619

This added 2% to our performance on average. N=2 was found by manual iteration—it seems that both an increase and a decrease of the value from here leads to a drop in performance.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.121.79