Acquiring data and generating features

For easier reference, we will implement the code for generating features here rather than in later sections. We will start with obtaining the dataset we need for our project.

Throughout the entire project, we acquire stock index price and performance data from Yahoo Finance. For example, in the Historical Data page, http s://finance.yahoo.com/quote/%5EDJI/history?p=%5EDJI, we can change the Time Period to Dec 01, 2005 - Dec10, 2005, select Historical Prices in Show, and Daily in Frequency (or open this link directly: https://finance.yahoo.com/quote/%5EDJI/history?period1=1133413200&period2=1134190800&interval=1d&filter=history&frequency=1d), then click on the Apply button. Click the Download data button to download the data and name the file 20051201_20051210.csv.

We can load the data we just downloaded as follows:

>>> mydata = pd.read_csv('20051201_20051210.csv', index_col='Date')
>>> mydata
               Open         High         Low          Close 
Date
2005-12-01 10806.030273 10934.900391 10806.030273 10912.570312
2005-12-02 10912.009766 10921.370117 10861.660156 10877.509766
2005-12-05 10876.950195 10876.950195 10810.669922 10835.009766
2005-12-06 10835.410156 10936.200195 10835.410156 10856.860352
2005-12-07 10856.860352 10868.059570 10764.009766 10810.910156
2005-12-08 10808.429688 10847.250000 10729.669922 10755.120117
2005-12-09 10751.759766 10805.950195 10729.910156 10778.580078

              Volume    Adjusted Close
Date
2005-12-01 256980000.0   10912.570312
2005-12-02 214900000.0   10877.509766
2005-12-05 237340000.0   10835.009766
2005-12-06 264630000.0   10856.860352
2005-12-07 243490000.0   10810.910156
2005-12-08 253290000.0   10755.120117
2005-12-09 238930000.0   10778.580078

Note the output is a pandas dataframe object. The Date column is the index column, and the rest columns are the corresponding financial variables. If you have not installed pandas, the powerful package designed to simplify data analysis on relational (or table-like) data, you can do so via the following command line:

 pip install pandas

Next, we implement feature generation by starting with a sub-function that directly creates features from the original six features, as follows:

>>> def add_original_feature(df, df_new):
...     df_new['open'] = df['Open']
...     df_new['open_1'] = df['Open'].shift(1)
...     df_new['close_1'] = df['Close'].shift(1)
...     df_new['high_1'] = df['High'].shift(1)
...     df_new['low_1'] = df['Low'].shift(1)
...     df_new['volume_1'] = df['Volume'].shift(1)

Then we develop a sub-function that generates six features related to average close prices:

>>> def add_avg_price(df, df_new):
...     df_new['avg_price_5'] = df['Close'].rolling(5).mean().shift(1)
...     df_new['avg_price_30'] = df['Close'].rolling(21).mean().shift(1)
...     df_new['avg_price_365'] = df['Close'].rolling(252).mean().shift(1)
...     df_new['ratio_avg_price_5_30'] = 
                          df_new['avg_price_5'] / df_new['avg_price_30']
...     df_new['ratio_avg_price_5_365'] = 
                          df_new['avg_price_5'] / df_new['avg_price_365']
...     df_new['ratio_avg_price_30_365'] = 
                          df_new['avg_price_30'] / df_new['avg_price_365']

Similarly, a sub-function that generates six features related to average volumes is as follows:

>>> def add_avg_volume(df, df_new):
...     df_new['avg_volume_5'] = df['Volume'].rolling(5).mean().shift(1)
...     df_new['avg_volume_30'] = df['Volume'].rolling(21).mean().shift(1)
...     df_new['avg_volume_365'] = 
                        df['Volume'].rolling(252).mean().shift(1)
...     df_new['ratio_avg_volume_5_30'] = 
                        df_new['avg_volume_5'] / df_new['avg_volume_30']
...     df_new['ratio_avg_volume_5_365'] = 
                        df_new['avg_volume_5'] / df_new['avg_volume_365']
...     df_new['ratio_avg_volume_30_365'] = 
                        df_new['avg_volume_30'] / df_new['avg_volume_365']

As for the standard deviation, we develop the following sub-function for the price-related features:

>>> def add_std_price(df, df_new):
...     df_new['std_price_5'] = df['Close'].rolling(5).std().shift(1)
...     df_new['std_price_30'] = df['Close'].rolling(21).std().shift(1)
...     df_new['std_price_365'] = df['Close'].rolling(252).std().shift(1)
...     df_new['ratio_std_price_5_30'] = 
                          df_new['std_price_5'] / df_new['std_price_30']
...     df_new['ratio_std_price_5_365'] = 
                          df_new['std_price_5'] / df_new['std_price_365']
...     df_new['ratio_std_price_30_365'] = 
                          df_new['std_price_30'] / df_new['std_price_365']

Similarly, a sub-function that generates six volume-based standard deviation features is as follows:

>>> def add_std_volume(df, df_new):
...     df_new['std_volume_5'] = df['Volume'].rolling(5).std().shift(1)
...     df_new['std_volume_30'] = df['Volume'].rolling(21).std().shift(1)
...     df_new['std_volume_365'] = df['Volume'].rolling(252).std().shift(1)
...     df_new['ratio_std_volume_5_30'] = 
                        df_new['std_volume_5'] / df_new['std_volume_30']
...     df_new['ratio_std_volume_5_365'] = 
                        df_new['std_volume_5'] / df_new['std_volume_365']
...     df_new['ratio_std_volume_30_365'] = 
                        df_new['std_volume_30'] / df_new['std_volume_365']

And seven return-based features are generated using the following sub-function:

>>> def add_return_feature(df, df_new):
...     df_new['return_1'] = ((df['Close'] - df['Close'].shift(1)) /         
                                    df['Close'].shift(1)).shift(1)
...     df_new['return_5'] = ((df['Close'] - df['Close'].shift(5)) /     
                                    df['Close'].shift(5)).shift(1)
...     df_new['return_30'] = ((df['Close'] - df['Close'].shift(21)) / 
                                    df['Close'].shift(21)).shift(1)
...     df_new['return_365'] = ((df['Close'] - df['Close'].shift(252)) / 
                                    df['Close'].shift(252)).shift(1)
...     df_new['moving_avg_5'] = 
                        df_new['return_1'].rolling(5).mean().shift(1)
...     df_new['moving_avg_30'] = 
                        df_new['return_1'].rolling(21).mean().shift(1)
...     df_new['moving_avg_365'] = 
                        df_new['return_1'].rolling(252).mean().shift(1)

Finally, we put together the main feature generation function that calls all preceding sub-functions:

>>> def generate_features(df):
...     """
...     Generate features for a stock/index based on historical price 
           and performance
...     @param df: dataframe with columns "Open", "Close", "High", 
                   "Low", "Volume", "Adjusted Close"
...     @return: dataframe, data set with new features
...     """
...     df_new = pd.DataFrame()
...     # 6 original features
...     add_original_feature(df, df_new)
...     # 31 generated features
...     add_avg_price(df, df_new)
...     add_avg_volume(df, df_new)
...     add_std_price(df, df_new)
...     add_std_volume(df, df_new)
...     add_return_feature(df, df_new)
...     # the target
...     df_new['close'] = df['Close']
...     df_new = df_new.dropna(axis=0)
...     return df_new

It is noted that the window sizes here are 5, 21, and 252, instead of 7, 30, and 365 representing the weekly, monthly, and yearly window. This is because there are 252 (rounded) trading days in a year, 21 trading days in a month, and 5 in a week.

We can apply this feature engineering strategy on the DJIA data queried from 1988 to 2016 as follows:

>>> data_raw = pd.read_csv('19880101_20161231.csv', index_col='Date')
>>> data = generate_features(data_raw)

Take a look at what the data with the new features looks like:

>>> print(data.round(decimals=3).head(5))

The preceding command line generates the following output:

Since all features and driving factors are ready, we should now focus on regression algorithms that estimate the continuous target variables based on these predictive features.

Table of Contents for Acquiring data and generating features

Create new playlist

Sign In

Sign Up

Table of Contents for
Acquiring data and generating features