Part B - Diagnostic plots

The residual fit spread for the variable stock shows two charts boxed together. These are the fit-mean and the residual. The fit-mean refers to the spread of the fitted values, and the residual graph refers to the spread of the residuals. The residuals are calculated based on the differences between the observed (actual) and the fitted (predicted). In the case of our model, the spread of the fitted is more than the spread of the residual, which means that the spread of the residual is less than the fitted, and the model can be used.

The flatter or more horizontal the shape of the residual, the better the chances of the spread of fitted being more favorably distributed:

Figure 2.12.1: Residual fit spread plot for stock

The Q-Q plot of residuals does point to some observations at the lower and top end of the quantile that have higher residuals, and don't fit as well as the rest of the population. When the stock price is low, then the model is under-predicting the stock value in some cases, and when the stock price is at the higher end, the model is over-predicting the value in some cases.

This is an observation that the modeler noted down, in case this bias can be overcome by building an alternative model:

Figure 2.12.2: Q-Q plot for regression model

The distribution of residuals is fairly normally distributed, with no particular skewness observed:

Figure 2.13: Distribution of residuals for regression model

Another main assumption of regression is that the residuals shouldn't form any pattern. Looking at the residual predicted stock chart, it seems that the residuals do not form any particular patterns. Hence, this assumption about regression also holds true for the model:

Figure 2.14.1: Residuals by predicted for regression model

The residuals from the predicted chart show that the residuals of predicted observations are higher when the stock price is around $4.80-$4.90. This chart reaffirms the observation made earlier, that the model is over-predicting the stock price when dealing with higher observed stock prices. Most of the residuals beyond the $5.50 stock price are negative. It is important for the modeler to decide whether the spread of the residuals is acceptable. This is a subjective call at times, and the level of acceptance around the spread differs from one business problem to another.

One of the main assumptions of a regression is that the residuals are normally distributed. By looking at the Q-Q plot, we can say that this assumption holds in this case:

Figure 2.14.2: Observed by predicted for regression model

The RStudent and the Cooks D plots can be used to assess the model output. A significant number of observations over the value of 2 on the RStudent plot should be of concern to the modeler. The RStudent and Cooks D plots highlight that there are at least a couple of data points that have high leverage and are influencing the overall model fit disproportionately. Remember that these aren't data quality related observations.

We already had a data point, 0.37, that was changed to 4.37, as the earlier observation was a data quality issue. Let us try to identify these two data points by altering our model statement:

Figure 2.15.1: RStudent for regression model

In the RStudent outlier chart, we can see that there are a couple of points with a leverage of more than 0.03 that seem to extreme observations.

Figure 2.15.2: RStudent, outlier, and Cook's D for regression model

In Figure 2.15.3, the Cook's D chart doesn't point to any particular data points that need investigation. Usually, a value greater than 2 merits investigation.

Figure 2.15.3: Cook's D for regression model

The PROC REG code for identifying high leverage observations is as follows:

PROC REG DATA=build plots(only label)=(RStudentByLeverage CooksD); 
ID date; 
MODEL stock = basket_index -- m1_money_supply_index; 
RUN;

As we can see from the partial output of the code run to highlight the leveraged data points, the stock prices observed on December 31, the 7th of March and April, and the 5th of July, 2017, are have a high influence. Having checked the values of the stock prices and other variables, the modeler did not find any reason why these shouldn't be included in the analysis dataset. Hence, these values were retained. No transformation or data treatment was deemed necessary:

Figure 2.16: Outlier labels for regression model

In this section, we validated that the model abides by two of the underlying principles of regression. We also understood the significance of various charts, and created a slightly modified version of an existing chart, to better understand our model.

Table of Contents for Part B - Diagnostic plots

Create new playlist

Sign In

Sign Up

Table of Contents for
Part B - Diagnostic plots