Bill Gates and his team at Microsoft are to blame for some very bad statistical analysis. Their Excel program has made regression and correlation analysis way too easy to do and too many people use it without any understanding of the process or logic in designing the analysis. Despite this creation of a potential monster from Microsoft, regression and correlation are two very helpful analytical tools. However, they do not provide all the answers in any situation and need to be used with careful thought about the data being input, the structure of the analysis, logic, and a healthy dose of cynicism about any analytical conclusions. Both regression and correlation analysis are based on historical data. Therefore, when major changes occur in how different factors interact, this data can be misleading. Even when these tools are applied to the physical sciences they are notorious for creating false positive and negatives, this logically would increase when applied to the less consistent social sciences like economics.
Simple regression attempts to use statistics to determine if the movement in one variable(s) (the independent variable) impacts how another variable (the dependent variable) moves. You can think of the independent variable as the cause and the dependent variable as the effect. Correlation is an output generated from regression analysis that measures the strength of a relationship between two variables. More advanced regression models can use multiple variables.
If a regression shows that a 1% growth rate in a country’s gross domestic product (GDP) (independent variable) usually causes a 2% increase in the price of the stock of the Sunshine Corp. (the dependent variable), this could be a powerful predictive tool for investing. This relationship could be positively or negatively correlated (e.g., a negative relationship would show a rise in GDP causes a decline in the stock price). This is a simple and unrealistic example of how regression can be used for investment decisions.
Regression uses a method known as “ordinary least squares” that generates a formula for a predicted line. The line will be made up of data points of what the regression analysis “predicts” the value of the dependent variable will be (plotted on the y axis) for each data point of the independent variable (plotted on the x axis). The dots around the line show the actual data points. The closer the dots cluster around the predicted line, the stronger the quality of the relationship between the two variables and then this would imply that the independent variable is a strong predictor of the dependent variable.
Figures 7.1 and 7.2 show two regressions. The outcome that the regression predicts is represented by the squares along the line. The diamonds show the actual results. Figure 7.1 shows a hypothetical relationship between the number of sales people and the level of sales, if the diamond is below the square for the same level of salespeople it implies that the salespeople should have produced more in sales according to the regression. The difference between the two is called a residual. Figure 7.1 shows a strong evidence of dependence on the independent variable. You could make the case that you could use the formula of the line to predict what the sales level would be for a given number of salespeople. Figure 7.2 shows a regression of the relationship between the commute time of employees and the sales per employee, this regression shows a much weaker relationship as there is little clustering around the predicted line.
Regression analysis kicks out lots of other data. Perhaps the most widely used is the coefficient of determination which helps to show how closely a regression fits or how well the independent variable explains the outcome of the dependent variable. This is also known as the R2 and is a measure of variance of the data from the mean.
R2 will be between 0–1. You can see in Figure 7.1 the R2 of the data is high and is 0.65 while in Figure 7.2 it is low and is only 0.10. The R is the correlation coefficient and measures the correlation of the independent and dependent variable. R measures the strength and direction of the relationship between the two variables, going left to right the upward sloping relationship in Figure 7.1 shows that it is a positive relationship (i.e., more salespeople results in more sales), Figure 7.2 has a downward sloping negative correlation, in this case the implication is that longer the commute the lower the sales per employee.
These measures of correlation are often used separately from a regression analysis. For example, you may be thinking of launching a new product, like an electric fork, before you do this you may want to see if in the past there was a correlation between new product launches and total sales revenue increases, or if new products just took revenue away from existing products.
These tools can be hugely valuable in analyzing relationships and trying to develop predictive tools. However, they need to be used with common sense. The number of times your golden retriever sniffs a bush on a morning walk may have a high correlation with the performance of the U.S. Treasury futures on that day. Even if the correlation is very high, it is hard to develop a theory that your dog is causing some change in the futures market. The famous saying on this topic is that, correlation is not causation.
With Excel’s regression so easy to run, sometimes there is not enough thought put into how you are going to run the regression and what variables you are going to use. Let us assume you are interested to see if the number of customer service people that you employee impacts your sales. Before you run a regression, you should think through some of the potential pitfalls of analyzing the relationship.
Potential problem 1:
Think about your choices of variables. The employees are not salespeople, these are employees servicing customers. Should there be a relationship? Are the customer service people cross-selling or up-selling products in addition to supplying service?
Potential problem 2:
Is the predictive model set up wrong? Is the increase in customer service employees dependent on how much revenue is being generated?
Does this relationship work well to a certain level of sales but then at some point adding more customer service people results in sales going down, implying it is a nonlinear relationship where regression does not work well?
Potential problem 4:
Is the relationship between the two variables caused by something else?
There is always the possibility you have the relationship between the dependent and independent variables wrong. In this example, it could be that as sales grow customer problems increase and this results in requiring more customer service support. In this case it might be that customer service people are being added because the company has an increasingly unhappy customer base and the revenue growth is unsustainable. Anytime you see a regression the logic of the relationship needs to be questioned. Even if the output from the regression looks good, be cynical.
There are numerous other potential problems with using regression and correlation techniques that are often ignored because of how easy Excel has made it to run. Excel can run many of the tests you can use to check for these problems. A common problem in casual regression analysis is when the sample size being used is too small to develop a statistically meaningful relationship, but unfortunately these small sometimes meaningless sample sizes do not stop people from throwing data into a regression analysis and talking about the relationships. Another common problem is autocorrelation, when the independent and dependent variable are not truly independent of each other, or more specifically their “residuals” are not independent. There is a statistical test for autocorrelation called the Durbin-Watson test and there are adjustments to the data that can be undertaken to remove autocorrelation. Heteroscedasticity is a problem when residuals of a regression model have very unequal variances. This is usually detected by plotting the residuals. Multicollinearity is a problem in multiple regressions when you are using more than one independent variable to try to predict the dependent variable and the independent variables are highly correlated with each other. There is also the potential to use hypothesis testing within set confidence levels that is quite common to use. These are just a few examples of some of the issues to consider in undertaking regression and these need to be explored if a regression analysis is really going to be helpful at all. It is also important to remember that regression analysis has issues when a relationship is not linear.
Earlier in the chapter residuals were discussed. Residuals can be an interesting tool in creating a framework to compare relative value of various potential investments. Let us assume that you run a regression on annual revenue growth (independent variable) versus the annual move in the stock price for twenty companies. For any company you chose the residuals will be the difference between the predicted value on the fitted line and the dot representing the actual data for the company. A hypothetical example is shown in Figure 7.3. Each diamond represents the growth rate in revenue and the stock move for each company, the fitted line represented by the squares shows what the regression is predicting the gain in the stock price should have been for each company, given its revenue growth. Theoretically any company dot above the line appears to have seen too large a stock price move, and arguably should trade down to its fair value, shown on the fitted line, the opposite is true of any company dot below the line. The numerical residuals (the difference between the fitted line and the dot) are shown in Figure 7.4. The table shows that the model predicts that based on its revenue growth Alpha stock should have been up 14%, but increased by about 2% more than that, implying it is overvalued by this methodology. This methodology could be used to develop a list of buys and sells, but it is very simplistic and can be easily manipulated by the companies and data you choose to use. Typically, it would require much more analysis to determine if there are other factors that explain these residuals.
All kinds of investments strategies focus on correlations. Some hedge funds and multi-asset strategies charge high fees and base their strategies on owning negatively correlated assets assuming that one asset will hedge the other if there is a dislocation. There may also be trading strategies put in place to take advantage of leading or lagging correlations between an index and derivatives on that index. However, there are hellacious tales of these supposed correlations looking brilliant until a period of stress happens and then the correlations break down. The infamous Asian currency crisis in 1997, the Russian default crisis in 1998 and the financial crisis of 2008 all led to changes in correlations that hurt some supposedly hedged trades. Economics does not follow the laws of physical science and regressions and correlations that look good can change quickly when just a few factors change the data. There is a myriad of reasons correlations breakdown. Sometimes the correlation calculations may have been based on a time that does not properly cover enough varied types of environments and sometimes there are changes in the make-up of the data. As an example, it appears that the development of fracking technology that increased product of oil in the United States caused a change in the relationship between the price of oil based on the European Brent Crude Oil Index and the U.S. dollar. There was a negative correlation between Brent crude oil prices and the U.S. dollar for a long time but then from 2015 through 2018 the correlations collapsed, money could have been lost if you assumed these negative correlations would be in place indefinitely. The correlations are shown in Figure 7.5. Correlations can show some powerful relationships that are well worth monitoring but they should never be viewed as irrefutable facts on which to base 100% of your investing decision; relationships in economics, as in secondary school, can be fleeting.
Let Us Not Forget Standard Deviation
Averages, regressions, and correlations are some of the most common tools used in investment and business analysis, another one is standard deviation. This measures how the data points in a data set spread out from the mean. A lower standard deviation occurs when the data points are all closer to the mean. If you assume that the data is in a normal distribution the standard deviation can be very valuable tool in developing a confidence level in the data you are analyzing. A normal distribution means that the data is equally distributed around the mean, or it can be thought of as the median and the mean being equal. When a data set is normally distributed 68.3% of the values will fall within one standard deviation of the mean and 95.5% within two standard deviations of the mean. If you are measuring the number of tires manufactured in a factory per day and the average is 100 and the standard deviation is 20, that means on 68.3% of the days the factory will produce between 80–120 tires per day (either the mean + the standard deviation or the mean—the standard deviation).
When investors examine the volatility of an investment over time they typically use standard deviation. Investors will look at a times series of return data for an investment. The standard deviation of these returns will be viewed as the level of volatility in the returns of this investment. Traditional portfolio theory associated volatility of returns with the risk of the investment. A common goal is to maximize return and minimize risk. Though volatility measures a level of unpredictability, it does not actually measure loss and loss is the true risk when you are investing. Other ways to measure risk should be considered such as analysis of average drawdowns or value at risk calculations. Using standard deviation of returns as a measure of risk also assumes that you are looking at a market that operates efficiently, meaning there are very frequent and regular transactions from which you get regular price data, such as stock and currency markets. In less liquid investments standard deviation may be a much less valuable measure of risk. This also can make comparisons of the volatility of different asset classes a bit difficult if they have different levels of trading liquidity.
Regression and correlation analysis are incredibly helpful tools in understanding investment relationships, but they are tools, not the answer. Given the huge amounts of data that are available now and the ease of running a regression and correlation, there are more possibilities of doing poorly constructed analysis and making bad investment decisions. This can lead to developing and overweighting casual relationships between variables that are not truly meaningful. However, there are things you can do to limit the risk of being misled.
–Understand what the data is that you are using as well as possible. Try to analyze how the data was gathered, is it an estimate, is it based on a survey, has it been recently impacted by a major technological or sociological change?
–Remember all of the data used in regression and correlation is from the past. For example, think about how meaningless historical data on the relationship between job growth and public transportation may be, if the region you are studying is growing rapidly in telecommuting jobs.
–Be clear about what the question is you are actually looking to answer. Do not try to do too much just because there is so much data available. You may have 100 years of inflation data on 100 different products, but if you are trying to see if rising oil prices in Europe impact the cost of manufactured goods you will want to be selective on which periods and which data you utilize.
–Try not to try to force the data to a conclusion you want. Look at other relationships than the one you are trying to prove. Do not add complexity to get the answer you want.
–Do not just rely on R2 alone look at other tests, there may be other factors that are skewing the data, check for things like autocorrelation.
–Always look at the data visually as well as numerically. Sometimes it is helpful to sort the data in other visual ways than just a scatter diagram, use histograms and other layouts.
–Share all the data and the visual representations with other people that can question your work and look at it from different angles.
–Develop your own checklist to avoid misusing these statistical techniques.
Selected Ideas from Parts 1, 2, and 3
3.15.25.186