Feature engineering

When it comes to a machine learning algorithm, the first question to ask is usually what features are available, or what the predictive variables are. The driving factors used to predict future prices of DJIA, the Close prices herein, obviously include historical and current Open prices and historical performance (High, Low, and Volume). Note that current or same-day performance (High, Low, and Volume) should not be included as we simply cannot foresee the highest and lowest prices at which the stock traded or the total number of shares traded before the market is closed on that day.

Predicting close price with only these four indicators does not seem promising, and might lead to underfitting. So we need to think of ways to add more features and predictive power. In machine learning, feature engineering is the process of creating domain-specific features based on existing features in order to improve the performance of a machine learning algorithm. Feature engineering requires sufficient domain knowledge and it can be very difficult and time-consuming. In reality, features used to solve a machine learning problem are usually not directly available and need to be particularly designed and constructed, for example, term frequency or tf-idf features in spam email detection and news classification. Hence, feature engineering is essential in machine learning, and it is usually what we spend most effort on in solving a practical problem.

When making an investment decision, investors usually look at historical prices over a period of time, not just the price the day before. Therefore, in our stock price prediction case, we can compute the average close price over the past week (five days), over the past month, and over the past year as three new features. We can also customize the time window to the size we want, such as the past quarter, the past six months. On top of these three averaged price features, we can generate new features associated with the price trend by computing the ratios between each pair of average price in three different time frames. For instance, the ratio between the average price over the past week and that over the past year. Besides prices, volume is another important factor that investors analyze. Similarly, we can generate new volume-based features by computing the average volumes in several different time frames and ratios between each pair of averaged values.

Besides historical averaged values in a time window, investors also greatly consider stock volatility. Volatility describes the degree of variation of prices for a given stock or index over time. In statistical terms, it is basically the standard deviation of the close prices. We can easily generate new sets of features by computing the standard deviation of close prices in a particular time frame, as well as the standard deviation of volumes traded. In a similar manner, ratios between each pair of standard deviation values can be included in our engineered feature pool.

Last but not least, return is a significant financial metric that investors closely watch for. Return is the percentage of gain or loss of close price for a stock/index in a particular period. For example, daily return and annual return are the financial terms that we frequently hear. They are calculated as follows:

Where is the price on the ith day and is the price on the day before. Weekly and monthly return can be computed in a similar way. Based on daily returns, we can produce a moving average over a particular number of days. For instance, given daily returns of the past week ,, , , , we can calculate the moving average over that week as follows:

In summary, we can generate the following predictive variables by applying feature engineering techniques:

  • The average close price over the past five days
  • The average close price over the past month
  • The average close price over the past year
  • The ratio between the average price over the past week and that over the past month
  • The ratio between the average price over the past week and that over the past year
  • The ratio between the average price over the past month and that over the past year
  • The average volume over the past five days
  • The average volume over the past month
  • The average volume over the past year
  • The ratio between the average volume over the past week and that over the past month
  • The ratio between the average volume over the past week and that over the past year
  • The ratio between the average volume over the past month and that over the past year
  • The standard deviation of the close prices over the past five days
  • The standard deviation of the close prices over the past month
  • The standard deviation of the close prices over the past year
  • The ratio between the standard deviation of the prices over the past week and that over the past month
  • The ratio between the standard deviation of the prices over the past week and that over the past year
  • The ratio between the standard deviation of the prices over the past month and that over the past year
  • The standard deviation of the volumes over the past five days
  • The standard deviation of the volumes over the past month
  • The standard deviation of the volumes over the past year
  • The ratio between the standard deviation of the volumes over the past week and that over the past month
  • The ratio between the standard deviation of the volumes over the past week and that over the past year
  • The ratio between the standard deviation of the volumes over the past month and that over the past year
  • Daily return of the past day
  • Weekly return of the past week
  • Monthly return of the past month
  • Yearly return of the past year
  • Moving average of the daily returns over the past week
  • Moving average of the daily returns over the past month
  • Moving average of the daily returns over the past year

Eventually we are able to generate in total 31 sets of features, along with six original features:

  • Open price
  • Open price on the past day
  • Close price on the past day
  • Highest price on the past day
  • Lowest price on the past day
  • Volume on the past day
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.200.71