Dummy encoding of categorical variables

We need to convert the categorical stock variable into a numeric format so that the linear regression can process it. For this purpose, we use dummy encoding that creates individual columns for each category level and flags the presence of this level in the original categorical column with an entry of 1, and 0 otherwise. The pandas function get_dummies() automates dummy encoding. It detects and properly converts columns of type objects as illustrated next. If you need dummy variables for columns containing integers, for instance, you can identify them using the keyword columns:

df = pd.DataFrame({'categories': ['A','B', 'C']})

categories
0 A
1 B
2 C

pd.get_dummies(df)

categories_A categories_B categories_C
0 1 0 0
1 0 1 0
2 0 0 1

When converting all categories to dummy variables and estimating the model with an intercept (as you typically would), you inadvertently create multicollinearity: the matrix now contains redundant information and no longer has full rank, that is, becomes singular. It is simple to avoid this by removing one of the new indicator columns. The coefficient on the missing category level will now be captured by the intercept (which is always 1 when every other category dummy is 0). Use the drop_first keyword to correct the dummy variables accordingly: 

pd.get_dummies(df, drop_first=True)

categories_B categories_C
0 0 0
1 1 0
2 0 1

Applied to our combined features and returns, we obtain 181 columns because there are more than 100 stocks as the universe definition automatically updates the stock selection:

X = pd.get_dummies(data.drop(return_cols, axis=1), drop_first=True)
X.info()

MultiIndex: 47377 entries, (2014-01-02 00:00:00+00:00, Equity(24 [AAPL])) to (2015-12-31 00:00:00+00:00, Equity(47208 [GPRO]))
Columns: 181 entries, DividendYield to stock_YELP INC
dtypes: float64(182)
memory usage: 66.1+ MB
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.12.172