Data cleaning – missing data

In a next step, we remove rows and columns that lack more than 20 percent of the observations, resulting in a loss of six percent of the observations and three columns:

rows_before, cols_before = data.shape
data = (data
.dropna(axis=1, thresh=int(len(data) * .8))
.dropna(thresh=int(len(data.columns) * .8)))
data = data.fillna(data.median())
rows_after, cols_after = data.shape
print('{:,d} rows and {:,d} columns dropped'.format(rows_before - rows_after, cols_before - cols_after))
2,985 rows and 3 columns dropped

At this point, we have 51 features and the categorical identifier of the stock:

data.sort_index(1).info()

MultiIndex: 47377 entries, (2014-01-02, Equity(24 [AAPL])) to (2015-12-
31, Equity(47208 [GPRO]))
Data columns (total 52 columns):
AssetToEquityRatio 47377 non-null float64
AssetTurnover 47377 non-null float64
CFO To Assets 47377 non-null float64
...
WorkingCapitalToAssets 47377 non-null float64
WorkingCapitalToSales 47377 non-null float64
stock 47377 non-null object
dtypes: float64(51), object(1)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.214