Dummy coding

Dummy coding is a method where, if we had one column that had a student's favorite class as a predictor variable in one column, we would turn each class into its own column and then place a 1 in that column if it was the favorite class of the student, as seen in the following diagram:

Source: http://www.statisticssolutions.com/dummy-coding-the-how-and-why/

Once that is done, then the next step is to actually drop one of those columns. The dropped column then becomes the base case. All the other cases are then compared to that case. In our IPO example using months as predictors, we will drop January, for example, and then all the other months will be judged against January's performance. The same goes for the days of the week or any other categorical predictor. This dropping of a column is to prevent multicollinearity, which would have a negative impact on the explanatory power of the model.

Let's take a look at what this coding looks like by running the following in a Jupyter cell:

X 

The preceding code generates the following output:

Now that we have both our X and y, we are ready to fit our model. We are going use a very basic train/test split and simply train our model on all but the last 200 IPOs:

from sklearn.linear_model import LogisticRegression 
 
X_train = X[:-200] 
y_train = y[:-200] 
 
X_test = X[-200:] 
y_test = y[-200:] 
 
clf = LogisticRegression() 
clf.fit(X_train, y_train) 

And, with that, we have our model. Let's examine the performance of this very simple model.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.226.240