Getting started with feature engineering

When it comes to a machine learning algorithm, the first question to ask is usually what features are available or what the predictive variables are.

The driving factors that are used to predict future prices of DJIA, the close prices, include historical and current open prices as well as historical performance (high, low, and volume). Note that current or same-day performance (high, low, and volume) shouldn't be included because we simply can't foresee the highest and lowest prices at which the stock traded or the total number of shares traded before the market is closed on that day.

Predicting the close price with only those preceding four indicators doesn't seem promising and might lead to underfitting. So we need to think of ways to generate more features in order to increase predictive power. To recap, in machine learning, feature engineering is the process of creating domain-specific features based on existing features in order to improve the performance of a machine learning algorithm. Feature engineering usually requires sufficient domain knowledge and can be very difficult and time-consuming. In reality, features used to solve a machine learning problem are not usually directly available and need to be particularly designed and constructed, for example, term frequency or tf-idf features in spam email detection and newsgroup classification. Hence, feature engineering is essential in machine learning and is usually where we spend most efforts in solving a practical problem.

When making an investment decision, investors usually look at historical prices over a period of time, not just the price the day before. Therefore, in our stock price prediction case, we can compute the average close price over the past week (five trading days), over the past month and over the past year as three new features. We can also customize the time window to the size we want, such as the past quarter or the past six months. On top of these three averaged price features, we can generate new features associated with the price trend by computing the ratios between each pair of average prices in three different time frames. For instance, the ratio between the average price over the past week and that over the past year. Besides prices, volume is another important factor that investors analyze. Similarly, we can generate new volume-based features by computing the average volumes in several different time frames and ratios between each pair of averaged values.

Besides historical averaged values in a time window, investors also greatly consider stock volatility. Volatility describes the degree of variation of prices for a given stock or index over time. In statistical term, it's basically the standard deviation of the close prices. We can easily generate new sets of features by computing the standard deviation of close prices in a particular time frame, as well as the standard deviation of volumes traded. In a similar manner, ratios between each pair of standard deviation values can be included in our engineered feature pool.

Last but not least, return is a significant financial metric that investors closely watch for. Return is the percentage of gain or loss of close price for a stock/index in a particular period. For example, daily return and annual return are financial terms we frequently hear. They are calculated as follows:

Here, pricei is the price on the ith day and pricei-1 is the price on the day before. Weekly and monthly returns can be computed in a similar way. Based on daily returns, we can produce a moving average over a particular number of days. For instance, given daily returns of the past week, returni:i-1, returni-1:i-2, returni-2:i-3, returni-3:i-4, returni-4:i-5, we can calculate the moving average over that week as follows:

In summary, we can generate the following predictive variables by applying feature engineering techniques:

Eventually, we are able to generate in total 31 sets of features, along with the following six original features:

  • OpenPriceiThis feature represents the open price 
  • OpenPricei-1: This feature represents the open price on the past day 
  • ClosePricei-1: This feature represents the close price on the past day 
  • HighPricei-1This feature represents the highest price on the past day 
  • LowPricei-1This feature represents the lowest price on the past day 
  • Volumei-1This feature represents the volume on the past day 

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.236.27