CHAPTER 8

Qualitative Events and Seasonality in Multiple Regression Models

Chapter 8 Preview

When you have completed reading this chapter you will be able to:

  • Understand the meaning of “seasonality” in business data.
  • Take seasonality into account as a pattern in your models.
  • Account for seasonality by using dummy variables.
  • Understand the use of dummy variables to represent a wide range of events.
  • Modify the Market Share model from Chapter 7 for a nonrecurring event.
  • Use dummy variables to account for seasonality in women’s clothing sales.

Introduction

Most of the variables you may want to use in developing a regression model are readily measurable with values that extend over a wide range and are interval or ratio data. All of the examples you have examined so far have involved variables of this type.

However, on occasion you may want to account for the effect of some event or attribute that has only two (or a few) possible cases. It either “is” or “is not” something. It either “has” or “does not have” some attribute. Some examples include the following: a month either is or is not June; a person either is or is not a woman; in a particular time period, there either was or was not a labor disturbance; a given quarter of the year either is or is not a second quarter; a teacher either has a doctorate or does not have one; a university either does or does not offer an MBA; and so forth.

The manner in which you can account for these noninterval and non-ratio data is by using dummy variables. The term “dummy” variable refers to the fact that this class of variable “dummies,” or takes the place of, something you wish to represent that is not easily described by a more common numerical variable. A dummy variable (or several dummy variables) can be used to measure the effects of what may be qualitative attributes. A dummy variable is assigned a value of 1 or 0, depending on whether or not the particular observation has a given attribute. To explain the use of dummy variables consider two examples.

Another Look at Miller’s Foods’ Market Share Regression Model

The first example is an extension of the multiple linear regression discussed in Chapter 7 dealing with a model of Miller’s Foods’ market share (MS). In that situation, you saw a regression model in which the dependent variable MS was a function of three independent variables: price (P), advertising (AD), and an index of competitors’ advertising (CAD). You saw the results of that regression in Table 7.2 and in Figure 7.1.

However, it so happens that in the second quarter of the third year (quarter 10), another major firm had a fire that significantly reduced its production and sales. As an analyst, you might suspect that this event could have influenced the MS of all firms for that period (including Miller’s Foods). At the very least, it probably introduced some noise (error) into the data used for the regression and may have reduced the explanatory ability of the regression model.

You can use a dummy variable to account for the influence of the fire at the competitor’s facility on Miller’s Foods’ MS and to measure its effect. To do so, you simply create a new variable—call it D—that has a value of 0 for every observation except the tenth quarter and a value of 1 for the tenth quarter. The data from Table 7.1 are reproduced in Table 8.1, but now include the addition of this dummy variable in the last column.

Table 8.1 Twelve quarters (three years) of Miller’s Foods’ MS multiple regression data with the dummy variable

Period

MS

P

AD

CAD

D

1

19

5.2

500

11

0

2

17

5.32

550

11

0

3

14

5.48

550

12

0

4

15

5.6

550

12

0

5

18

5.8

550

9

0

6

16

6.03

660

10

0

7

16

6.01

615

10

0

8

19

5.92

650

10

0

9

23

5.9

745

9

0

10

27

5.85

920

10

1

11

23

5.8

1,053

11

0

12

21

5.85

950

11

0

Using the data from Table 8.1 in a multiple regression analysis, you get the following equation for MS:

MS = 72.908 – 7.453(P) + .017(AD) – 2.245(CAD) + 4.132(D)

The complete regression results are shown in Table 8.2. You see that the coefficient of the dummy variable representing the fire is 4.132. The fact that it is positive indicates that when there was a fire at a major competitor’s facility, the market share (MS) of Miller’s Foods increased. This certainly makes sense. In addition, you can say that this event accounted for 4.132 of the 27 percent MS obtained during that second quarter of Year 3.

You should compare these regression results with those from Table 7.2. Note particularly the following:

  1. The adjusted R-squared increased from 0.812 to 0.906.
  2. The standard error of the regression (SEE) fell from 1.676 to 1.183.

Table 8.2 Miller’s Foods’ MS regression results including a dummy variable

Regression Statistics

Multiple R

0.970

DW = 1.83

R-square

0.940

Adjusted R-square

0.906

Standard error

1.183

Observations

12

ANOVA

df

SS

MS

F

Sig. F

Regression

4

154.202

38.550

27.541

0.000

Residual

7

9.798

1.400

Total

11

164

Coefficients

Std. error

t Stat

p-Value

p/2

Intercept

72.908

13.955

5.225

0.001

0.001

P

–7.453

1.939

–3.845

0.006

0.003

AD

0.017

0.002

7.045

0.000

0.000

CAD

–2.245

0.468

–4.802

0.002

0.001

D

4.132

1.374

3.008

0.020

0.010

Figure 8.1 Miller’s Foods’ MS: actual and predicted with dummy variable. The line representing the estimated MS for Miller’s Foods’ is derived from the following regression model: MS = 72.908 – 7.453(P) + .017(AD) – 2.245(CAD) + 4.132(D)

You see that the regression with the dummy variable is considerably better than the first model presented in Chapter 7. This is further illustrated by the graph in Figure 8.1, which shows the actual MS and the MS as estimated with the multiple regression model above. Comparing Figure 8.1 with Figure 7.1 provides a good visual representation of how much the model is improved by adding the dummy variable to account for the abnormal influence in the tenth quarter.

Modeling the Seasonality of Womens’ Clothing Sales

As a second example of the use of dummy variables, let’s look at how they can be used to account for seasonality. Table 8.3 contains data for womens’ clothing sales (WCS) on a monthly basis from January 2000 to December 2011. The full series of sales, from January 2000 to March 2001, is also graphed in Figure 8.2. You see in the graph that December sales are always high due to the heavy holiday sales experienced by most retailers, while January tends to have relatively low sales (when holiday giving turns to bill paying).

Eleven dummy variables can be used to identify and measure the seasonality in the firm’s sales. In Table 8.3, you see that for each February observation the dummy variable February has a value of 1, but March and April are 0 (as are the dummy variables representing all the remaining months). Note that for each January none of the 11 dummy variables has a value of 1. That is because the first month (i.e., January) is not February, March, or any other month. Having a dummy variable for every month other than January makes January the base month for the model.

The dummy variables for the other months will then determine how much, on average, those months vary from the first month base period. Any of the 12 months could be selected as the base. In this example, January is selected as the base month because it is the lowest month of the year on average, after taking the upward trend in the data into account. Thus, you would expect the February, March, and April dummy variables to all have positive coefficients in the regression model.

First let’s see what happens if you just estimate a simple regression using personal income (PI) as the only independent variable. The equation that results is:

WCS = 1,187.12 + 0.165(PI)

Table 8.3 Women’s clothing sales (in M$) and 11 seasonal dummy variables. Only the first two years are shown so that you see how the dummy variables are set up

This function is plotted in the top graph of Figure 8.2 along with the raw sales data. As you can see, the estimate goes roughly through the center of the data without coming very close to any month’s actual sales on a consistent basis.

Now if you add the 11 dummy variables to the regression model, the multiple regression model that results is:

Figure 8.2 Women’s clothing sales data. Compare these two graphs to see how helpful the dummy variables for seasonality can be

WCS = 572.16 + 0.16(PI) + 147.36(Feb) + 765.96(Mar)

+ 859.48(Apr) + 900.81(May) + 631.60(Jun) + 411.26(Jul)

+ 584.74(Aug) + 579.07(Sep) + 701.99(Oct) + 901.60(Nov)

+ 2,094.95(Dec)

Table 8.4 Regression results for WCS with seasonal dummy variables

Regression Statistics

Multiple R

0.963

DW = 0.52

R-square

0.927

Adjusted

R-square

0.919

Standard error

163.410

Observations

135

ANOVA

df

SS

MS

F

Sig F

Regression

12

41,101,414

3,425,118

128.269

0.000

Residual

122

3,257,726.8

26,702.68

Total

134

44,359,141

Coefficients

Std. error

t Stat

p-Value

p/2

Intercept

572.163

113.464

5.043

0.000

0.000

PI

0.157

0.010

16.061

0.000

0.000

Feb

147.356

66.712

2.209

0.029

0.015

Mar

765.957

66.715

11.481

0.000

0.000

Apr

859.476

68.218

12.599

0.000

0.000

May

900.811

68.212

13.206

0.000

0.000

Jun

631.599

68.211

9.259

0.000

0.000

Jul

411.262

68.211

6.029

0.000

0.000

Aug

584.738

68.212

8.572

0.000

0.000

Sep

579.067

68.213

8.489

0.000

0.000

Oct

701.994

68.216

10.291

0.000

0.000

Nov

901.599

68.219

13.216

0.000

0.000

Dec

2,094.946

68.232

30.703

0.000

0.000

Now look at the lower graph in Figure 8.2 which shows estimated sales based on a model that includes PI and the 11 monthly dummy variables superimposed on the actual sales pattern. In many months, the estimated series performs so well that the two cannot be distinguished from one another. The complete regression results for this model are found in Table 8.4.

The interpretation of the regression coefficients for the seasonal dummy variables is straightforward. Remember that as this model is constructed, the first month (January) is the base period. Thus, the coefficients for the dummy variables can be interpreted as follows:

  • For the February dummy the regression coefficient 147.356 indicates that on average February sales are $147.356 million above January (base period) sales.
  • For the March dummy the regression coefficient 765.957 indicates that on average March sales are $765.957 million above January (base period) sales.
  • For the April dummy the regression coefficient 859.476 indicates that on average April sales are $859.476 million above January (base period) sales.

The remaining eight dummy variables would be interpreted in the same manner.

The regression coefficient for income (0.157) means that WCS are increasing at an average of $0.157 million for every billion dollar increase in PI. Note that WCS are denominated in millions of dollars while income is denominated in billions of dollars. Knowing the units of measurement for each variable is important.

Let’s summarize some of the results of these two regression analyses of WCS:

Without seasonal dummy variables

With seasonal dummy variables

R2 and Adjusted R2

  0.167

0.919

F-Statistic

  27.8

   128.3

Standard error of regression (SEE)

  525.16

  163.41

The model that includes the dummy variables to account for seasonality is clearly the superior model based on these diagnostic measures. The better fit of the model can also be seen visually in Figure 8.2. However, the DW statistic (0.52) in the larger model does indicate the presence of positive serial correlation.

As you have seen, the development of a model such as this to account for seasonality is not difficult. Dummy variables are not only an easy way to account for seasonality; they can also be used effectively to account for many other effects that cannot be handled by ordinary continuous numeric variables.

The model for WCS can be further improved by adding two additional variables: (1) the University of Michigan Index of Consumer Sentiment (UMICS), which measures the attitude of consumers regarding the economy; and (2) the women’s unemployment rate (WUR). Both the new variables are statistically significant and have the expected signs. It should be expected that as consumer confidence climbs, sales would also increase; the positive sign on the UMICS coefficient displays this characteristic. Likewise, as women’s unemployment rises it could be expected that sales of women’s clothing would decrease; the negative sign on the WUR is consistent with this belief. These results are reported in Table 8.5; note that the adjusted R-square has increased to 0.966. This model still has positive serial correlation so you would want to be cautious about the t-tests. However, since all the t-ratios are again quite large this may not be a problem. Also, because serial correlation does not bias the coefficients these can be correctly interpreted.

Table 8.5 Regression results for WCS with the addition of the UMICS and WUR and seasonal dummy variables

Regression Statistics

Multiple R

0.985

DW = 1.29

R-square

0.970

Adjusted R-square

0.966

Standard error

105.403

Observations

135

ANOVA

df

SS

MS

F

Sig F

Regression

14

43,025,964

3,073,283

276.628

0.000

Residual

120

1,333,176

11,110

Total

134

44,359,141

Coefficients

Std. error

t Stat

P-value

p/2

Intercept

–493.609

208.160

–2.371

0.019

0.010

PI

0.240

0.010

24.092

0.000

0.000

WUR

–69.366

8.044

–8.624

0.000

0.000

UMICS

6.581

1.199

5.488

0.000

0.000

Feb

170.311

43.187

3.944

0.000

0.000

Mar

795.541

43.260

18.390

0.000

0.000

Apr

879.704

44.303

19.857

0.000

0.000

May

916.233

44.175

20.741

0.000

0.000

Jun

643.657

44.091

14.599

0.000

0.000

Jul

427.283

44.096

9.690

0.000

0.000

Aug

612.048

44.249

13.832

0.000

0.000

Sep

609.528

44.346

13.745

0.000

0.000

Oct

746.065

44.594

16.730

0.000

0.000

Nov

938.182

44.384

21.138

0.000

0.000

Dec

2,116.818

44.094

48.007

0.000

0.000

In the appendix to this chapter, you will see another complete example of how dummy variables can be used to account for seasonality in data. This example expands on the analysis of occupancy for the Stoke’s Lodge model.

What You Have Learned in Chapter 8

  • You understand the meaning of “seasonality” in business data.
  • You can take seasonality into account as a pattern in your models.
  • You know how to account for seasonality by using dummy variables.
  • You understand the use of dummy variables to represent a wide range of events.
  • You are able to modify the MS model from Chapter 7 for a nonrecurring event.
  • You can use dummy variables to account for seasonality in WCS.

Appendix

Stoke’s Lodge Final Regression Results

Introduction

Recall that the Stoke’s Lodge Monthly Room Occupancy (MRO) series represents the number of rooms occupied per month in a large independent motel. As you can see in Figure 8A.1, there is a downward trend in occupancy and what appears to be seasonality. The owners wanted to evaluate the causes for the decline. During the time period being considered there was a considerable expansion in the number of casinos in the State, most of which had integrated lodging facilities. Also, gas prices (GP) were increasing. You have seen that multiple regression can be used to evaluate the degree and significance of these factors.

The Data

The dependent variable for this multiple regression analysis is MRO. These values are shown in Figure 8A.1 for the period from January 2002 to May 2010. You see sharp peaks that probably represent seasonality in occupancy. For a hotel in this location, December would be expected to be a slow month.

Figure 8A.1 Stoke’s Lodge room occupancy. This time-series plot of occupancy for Stoke’s Lodge appears to show considerable seasonality and a downward trend

Management had concerns about the effect of rising gas prices (GP) on the willingness of people to drive to this location (there is no good public, train, or air transportation available). They also had concerns about the influence that an increasing number of full service casino operations might have on their business because many of the casinos had integrated lodging facilities. Rather than use the actual number of casinos, the number of casino employees (CEs) is used to measure this influence because the casinos vary a great deal in size.

In Chapter 4, you saw the following model of monthly room occupancy (MRO) as a function of gas price (GP):

MRO = 9322.976 – 1080.448(GP)

You evaluated this model based on a four-step process and found it to be a pretty good model. The model was logical and statistically significant at a 95 percent confidence level. However, the coefficient of determination was only 14.8 percent (R2 = 0.148) and the model had positive serial correlation (DW = 0.691).

In Chapter 6, you saw that the original model could be improved by adding the number of casino employees (CE) in the state. This yielded the following model:

MRO = 15,484.483 – 1,097.686(GP) – 1,013.646(CE)

This time you evaluated the model based on a five-step process because you had two independent variables rather than one, so needed to consider multicollinearity. Again the model seemed pretty good. The model was logical and statistically significant at a 95% confidence level. The coefficient of determination increased to 20.1% (Adjusted R2 = 0.201). This model also had positive serial correlation (DW = 0.730). You will see that these results can be improved by using seasonal dummy variables.

A graph of the results from the model with just two independent variables is shown in Figure 8A.2. The dotted line representing the predictions follows the general downward movement of the actual occupancy. But what do you think is missing? Why are the residuals (errors) so large for most months? Do you think it could be because the model does not address the issue of seasonality in the data? If your answer is “yes” you are right. And, as with WCS, dummy variables can be used to deal with the seasonality issue.

Figure 8A.2 Actual MRO and MRO predicted (MROP). The predicted values are shown by the dotted line, based on MRO as a function of GP and CE. The coefficient of determination for this model is 20.1 percent

The Hypotheses

When you estimate a regression you know that you have certain hypotheses in mind. In this case you know from Chapters 4 and 6 that you have the following hypotheses related to GP and CE.

H0 : β 0

H1 : β < 0

That is, you have a research hypothesis that for both GP and CE the relationship with MRO is inverse. When either GP or CE increase you would expect MRO to decrease, so the alternative (research) hypothesis is that those slopes would be negative.

Based on conversations with Stoke’s Lodge management, you can expect that December is their slowest month of the year on average. You can set up dummy variables for all months except December to test to see if this is true. Thus, for the dummy variables representing January– November the hypotheses would be:

H0 : β 0

H1 : β > 0

This is due to the fact that you expect all dummy variables to be positive (by construction).

The first 15 months of data for this situation are shown in Table 8A.1. The regression results are shown in Table 8A.2.

The Results

From the regression output in Table 8A.2 you see some interesting results. First, the model follows the logic you would expect with negative coefficients for CP and CE, as well as positive coefficients for all 11 seasonal dummy variables. You also see from the F-test the as a whole the regression is very significant because the significance level (p-value) for F is 0.000. This means that you can be well over 95 percent confident that there is a significant relationship between MRO and the 13 independent variables in the model.

The t-ratios are all large enough that p/2 is less than the desired level of significance of 0.05 except for January and November. Recall that sometimes it is reasonable for you to drop your confidence level to 90 percent (a significance level of 10 percent, or 0.10). The coefficients for January and November satisfy this less stringent requirement. Look at the “Adjusted R-square” in Table 8A.2. It is 0.830, which tells you that this model explains about 83 percent of the variation in MRO for Stoke’s Lodge. This is much better than the 20.1 percent you saw when seasonality was not considered.

In Table 8A.2, you also see that the DW statistic of 1.883 is now much closer to 2.0, the ideal value. Based on the abbreviated DW table in an appendix to Chapter 4, you would be restricted to using the row for 40 observations and the column for 10 independent variables. Doing so, you would conclude that the test is indeterminate. If you look online you can find more extensive DW tables.1 For n = 100 and k = 13 you would find dl = 1.393 and du = 1.974 in which case the result is still indeterminate.

To evaluate the possibility of multicollinearity, you need to look at the correlation matrix shown in Table 8A.3. There you see that all of the correlations are quite small so there is no multicollinearity problem with this model. You may wonder why all the correlations for pairs of monthly dummy variables are not the same. This is because in the data you do not have an equal number of observations for all months. There are five more observations for January through May than for the other seven months.

Table 8A.1 The first 15 months of data. Notice the pattern of the values for the seasonal dummy variables. For each month in the “Date” column there is a one in the corresponding column representing that month

Date

MRO

GP

CE

Jan

Feb

Mar

Apr

May

June

July

Aug

Sept

Oct

Nov

Jan-02

6,575

1.35

5.3

1

0

0

0

0

0

0

0

0

0

0

Feb-02

7,614

1.46

5.2

0

1

0

0

0

0

0

0

0

0

0

Mar-02

7,565

1.55

5.2

0

0

1

0

0

0

0

0

0

0

0

Apr-02

7,940

1.45

5.2

0

0

0

1

0

0

0

0

0

0

0

May-02

7,713

1.52

5.1

0

0

0

0

1

0

0

0

0

0

0

Jun-02

9,110

1.81

5.0

0

0

0

0

0

1

0

0

0

0

0

Jul-02

10,408

1.57

5.0

0

0

0

0

0

0

1

0

0

0

0

Aug-02

9,862

1.45

5.0

0

0

0

0

0

0

0

1

0

0

0

Sep-02

9,718

1.60

5.1

0

0

0

0

0

0

0

0

1

0

0

Oct-02

8,354

1.57

5.8

0

0

0

0

0

0

0

0

0

1

0

Nov-02

6,442

1.56

6.6

0

0

0

0

0

0

0

0

0

0

1

Dec-02

6,379

1.46

6.4

0

0

0

0

0

0

0

0

0

0

0

Jan-03

5,585

1.51

6.4

1

0

0

0

0

0

0

0

0

0

0

Feb-03

6,032

1.49

6.4

0

1

0

0

0

0

0

0

0

0

0

Mar-03

8,739

1.43

6.3

0

0

1

0

0

0

0

0

0

0

0

Table 8A.2 Regression results for the full Stoke’s Lodge model

Regression Statistics

Multiple R

0.923

DW = 1.883

R-square

0.852

Adjusted R-Square

0.830

Standard error

743.124

Observations

101

ANOVA

df

SS

MS

F

Sig F

Regression

13

277,032,357

21,310,181

38.589

0.000

Residual

87

48,044,350

552,233.9

Total

100

325,076,708

Coefficients

Standard error

t Stat

p-Value

p/2

Intercept

10,270.912

1,065.996

9.635

0.000

0.000

GP

–1,396.275

117.891

–11.844

0.000

0.000

CE

–453.858

162.408

–2.795

0.006

0.003

Jan

547.215

361.715

1.513

0.134

0.067

Feb

1,303.684

361.737

3.604

0.001

0.000

Mar

2,705.610

362.592

7.462

0.000

0.000

Apr

2,140.244

363.438

5.889

0.000

0.000

May

3,671.132

365.245

10.051

0.000

0.000

June

3,650.990

375.260

9.729

0.000

0.000

July

4,305.826

374.846

11.487

0.000

0.000

Aug

4,500.630

375.350

11.990

0.000

0.000

Sept

3,466.594

374.798

9.249

0.000

0.000

Oct

2,550.378

372.192

6.852

0.000

0.000

Nov

547.899

371.616

1.474

0.144

0.072

Table 8A.3 The correlation matrix for all independent variables. All the correlation Coefficients are Quite Small so multicollinearity is not a problem

GP

CE

Jan

Feb

Mar

Apr

May

June

July

Aug

Sept

Oct

Nov

GP

1

CE

–0.02

1

Jan

–0.08

0.02

1

Feb

–0.05

0.02

–0.10

1

Mar

0.01

0.00

–0.10

–0.10

1

Apr

0.06

0.01

–0.10

–0.10

–0.10

1

May

0.13

0.00

–0.10

–0.10

–0.10

–0.10

1

June

0.02

–0.08

–0.09

–0.09

–0.09

–0.09

–0.09

1

July

0.01

–0.07

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

1

Aug

0.01

–0.08

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

1

Sept

0.03

–0.06

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

1

Oct

–0.02

0.04

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

1

Nov

–0.05

0.10

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

–0.09

1

The regression equation can be written based on the values in the “Coefficients” column in the regression results shown in Table 8A.2. The equation is:

MRO = 10,270.912 – 1,396.275(GP) – 453.858(CE) + 547.215(Jan) + 1,303.684(Feb) + 2705.610(Mar) + 2,140.244(Apr) + 3,671.132(May) + 3,650.990(June) + 4,305.826(July) + 4,500.630(Aug) + 3,466.594(Sept) + 2,550.378(Oct) + 547.899(Nov)

Figure 8A.3 shows you how well the predictions from this equation fit the actual occupancy data. You see visually that the fit is quite good. There are some summer peaks that appear to be underestimated. However, overall the fit is good, and certainly better than the fit shown in Figure 8A.2 for which seasonal dummy variables were not included in the model.

Figure 8A.3 Actual and predicted values for MRO. The predicted values are shown by the dotted line, based on MRO as a function of GP, CE, and 11 seasonal (monthly) dummy variables


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.197.92