Building a predictive content scoring model

Let's use what we have learned to create a model that can estimate the share counts for a given piece of content. We'll use the features we have already created, along with a number of additional ones.

Ideally, we would have a much larger sample of content—especially content that had more typical share counts—but we'll have to make do with what we have here.

We're going to be using an algorithm called random forest regression. In previous chapters, we looked at a more typical implementation of random forests that is based on classification, but here we're going to attempt to predict the share counts. We could consolidate our share classes into ranges, but it is preferable to use regression when dealing with continuous variables, which is what we're working with here.

To begin, we'll create a bare-bones model. We'll use the number of images, the site, and the word count. We'll train our model in terms of the number of Facebook likes. We're also going to be splitting our data into two sets: a training set and a test set.

First, we'll import the scikit-learn library, and then we'll prepare our data by removing the rows with nulls, resetting our index, and finally splitting the frame into our training and test set:

from sklearn.ensemble import RandomForestRegressor 
 
all_data = dfc.dropna(subset=['img_count', 'word_count']) 
all_data.reset_index(inplace=True, drop=True) 
 
train_index = [] 
test_index = [] 
for i in all_data.index: 
    result = np.random.choice(2, p=[.65,.35]) 
    if result == 1: 
        test_index.append(i) 
    else: 
        train_index.append(i) 

We used a random number generator with a probability set for approximately two-thirds and one-third to determine which row items (based on their index) would be placed in each set. Setting the probabilities like this ensures that we get approximately twice the number of rows in our training set compared to the test set. We can see this in the following code:

print('test length:', len(test_index), '
train length:', len(train_index)) 

The preceding code generates the following output:

Now, we'll continue with preparing our data. Next, we need to set up categorical encoding for our sites. Currently, our DataFrame has the name for each site represented with a string. We need to use dummy encoding. This creates a column for each site, and if the row has that particular site, then that column will be filled with a 1, while all the other columns for sites will be coded with a 0. Let's do that now:

sites = pd.get_dummies(all_data['site']) 
 
sites 

The preceding code generates the following output:

You can see from the preceding output how the dummy encoding appears.

We'll now continue:

y_train = all_data.iloc[train_index]['fb'].astype(int) 
X_train_nosite = all_data.iloc[train_index][['img_count', 'word_count']] 
 
X_train = pd.merge(X_train_nosite, sites.iloc[train_index], left_index=True, right_index=True) 
 
y_test = all_data.iloc[test_index]['fb'].astype(int) 
X_test_nosite = all_data.iloc[test_index][['img_count', 'word_count']] 
 
X_test = pd.merge(X_test_nosite, sites.iloc[test_index], left_index=True, right_index=True) 

With that, we've set up our X_test, X_train, y_test, and y_train variables. Now, we're going to use our training data to build our model:

clf = RandomForestRegressor(n_estimators=1000) 
clf.fit(X_train, y_train) 

With those two lines of code, we have trained our model. Let's use it to predict the Facebook likes for our test set:

y_actual = y_test 
deltas = pd.DataFrame(list(zip(y_pred, y_actual, (y_pred - y_actual)/(y_actual))), columns=['predicted', 'actual', 'delta']) 
 
deltas 

This code generates the following output:

Here, we can see the predicted values, the actual value, and the difference as a percentage  side by side. Let's take a look at the descriptive stats for this:

deltas['delta'].describe() 

The preceding code generates the following output:

This looks amazing. Our median error is 0! Well, unfortunately, this is a particularly useful bit of information as errors are on both sides—positive and negative—and tend to average out, which is what we can see here. Let's look at a more informative metric to evaluate our model. We're going to look at root mean square error as a percentage of the actual mean.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.165.115