Model selection and fitting

The modeler decided to test which of the variables could be important in predicting the variable stock. PROC CORR was run to test the relative strength of the predictor (independent/regressor) variables and the outcome (dependent) variable.

The PROC correlation code is as follows:

PROC CORR DATA=model outp=corr nosimple; 
ID Date; 
WITH Stock; 
VAR Basket_index -- M1_money_supply_index; 
RUN;

The correlation was run across all of the eight independent variables. The correlation values were expected to be between -1 and 1. The negative sign denotes that the dependent and independent variables are inversely correlated. The higher the value of the correlation coefficient close to -1 or 1, the greater the strength of the variables concerned:

Figure 2.7: Correlation output

The least correlated variable seems to be the media analytics index. However, at this point, the modeler was cautious to not confuse the strength of the correlation with the significance. Later on, the modeler was planning to test for significance. The modeler wanted to do that to ascertain whether something that is strongly correlated is also significant in explaining the variation in the dependent variable. One other check that he was planning to do later on was for multicollinearity. Multicollinearity is said to occur when two or more similar, independent variables are highly correlated with a dependent variable. For example, if we are trying to predict the on-time arrival of a train, instances where a particular train has the highest speed and also the fewest stops might both show a high correlation with the on-time arrival of a train. However, both the highest speed and fewest stops could be highly correlated themselves, as the railways might have given the least stops to the fastest trains. Such relationships between dependent variables may deteriorate the quality of the model, if both of these variables are selected in the final model.

The modeler now decided to build the regression model. The correlation measured the strength of the relationship between the two variables. However, in testing the correlation, the dependent and independent variables aren't explicitly assigned their roles. Regression goes a step further than correlation and tests the strength of the relationship between the independent and dependent variable. The model can also help us determine variables that are significant, and provide us with an equation that showcases the relationship between all variables. Before we look at the regression model of stock, let's delve into the concepts of regression.

Let's think about multiple examples, spanning economy, marketing, and education, to make things simpler. The GDP of the economy can be dependent on various factors, including the rate of employment, the growth rate of manufacturing and services, any adverse weather events, oil prices, and so on. In marketing, the sales of a new beauty cream can be dependent on the size of the target market, the brand perception of the company, the advertising budget, and so on. In the education sector, the performance of a student can be depend on the quality of teachers, time spent in activities outside of class, family background, and so on. All three instances—GDP, sales of new beauty cream, and performance of a student—are seemingly dependent on multiple factors, such as the price of oil, the advertising budget, the quality of teachers, and so on.

But not all of the independent variables will have an equal influence on the dependent variables. The extremely poor quality of teachers in a school might not be fully compensated by a student's parents being highly literate. In this case, the performance of a student (the dependent variable) might be influenced in a different magnitude by the quality of teachers and family background. Hence, the measure of this relationship becomes important, and that's what regression tries to do. Consider the mathematical equation that users of regression analysis have come across in the past:

Y = a₁X₁ + b₂X₂ + c₃X₃ + Z

The sales of a new beauty cream can be predicted if we use the target size of the market, the brand perception, and the advertising budget, and assign each of these independent variables a weight that they need to be multiplied with. The weight is the influence that each independent variable has in determining the dependent variable, Sales. The simplified equation of multiple regression – the marketing application – is give here:

Sales = Target Size*X₁ + Perception*X₂+ Ad Budget*X₃+ Unknown

In the preceding formula, the following applies:

Sales is the dependent variable, influenced by multiple variables
X₁-X₃ are the measures of influence that each variable has on the dependent variable (Sales)
If 100% of Sales can be determined by adding all of the three variables in the equation, then Unknown/intercept will have the value 0

The unknown (Z) is also called the constant, or the intercept. If all independent variables had a value of 0, then Z will be the value of the mean of the dependent. We spoke about the independent variables as though they have a positive relationship with the dependent variable. Quite often, this isn't true. The relationship can be inverse/negative:

Sales = Target Size*X₁ + _Perception*X₂₊ Ad Budget*X₃- CompetitorCampaign*X₄ + Unknown

Competitor campaigns on similar products are thought to have a negative impact and reduce the sales of the company in the previous example. By now, we have an overview of what regression does. But how does the regression equation get formulated? Because this is a two-dimensional environment, let us look at a simple regression equation, rather than multivariate regression:

Sales = Ad Budget*X₁ + Constant

Looking at the scatter plot of Sales and Ad Budget, it seems that there is a linear relationship between budget and sales. As the budget (x-axis) keeps on increasing for most of the data points, it seems that there is an increase in the sales:

Figure 2.8: Scatter plot of ad budget and sales

Let's now try to draw a straight line that represents the regression equation in Figure 2.9:

Figure 2.9: Regression line plot - regression fundamentals

Rather than just drawing a line through the scatter plots, the aim is to draw a line in a way that:

Most points fall on or around the regression line that we draw across the scatter plot
The distances of the individual observations from the regression line are as small as possible

We have also added a confidence limit around the regression equation, showing the scatter of the data points around the line. Most of the data points falling on or around the regression equation are within the 95% confidence limit. However, some points near the middle of the line and the top part are far from the regression line. In modeling, we do measure the distances of these data points from the regression line, to understand the predictive power of the regression equation. The data points when sales are 5,000 million against an ad budget of 56 million don't fall on the regression line. This could happen, because at this inflexion point of budget and sales, the impact of the budget on sales may be higher or lower in magnitude than the linear relationship effect that we saw at other inflexion points.

There could be some other factors contributing to the sales that have not been captured in the regression equation. Maybe some of the customers were paid a bonus at this point, and sales increased. The variable bonus could be added at this inflection point. This could be added as a dummy variable with the value 1, at the inflection points of budget and sales, where we think it has a role to play. The regression line could be redrawn after the addition, to see if the dummy variable is significant.

Having explored the basics of regression, let's turn back to the main business problem of forecasting the stock value of our mobile phone manufacturer. Our modeler has just run a regression model on the data. Let's evaluate its results.

The PROC regression code is as follows:

PROC REG DATA=build plots=diagnostics(unpack); 
ID date; 
MODEL stock = basket_index -- m1_money_supply_index; 
RUN;

Table of Contents for Model selection and fitting

Create new playlist

Sign In

Sign Up

Table of Contents for
Model selection and fitting