Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Regression with dummy variables

Previous chapters used quantitative data to demonstrate important statistical concepts. However, some of the data financial analysts use is qualitative (see Chapter 2 for a discussion of the distinction between qualitative and quantive data). Dummy variables, briefly described in Chapter 2, are a way of turning qualitative variables into quantitative variables. Once the variables are quantitative, then the correlation and regression techniques described in previous chapters can be used. Formally, a dummy variable is a variable that can take on only two values, 0 or 1. We will demonstrate how regression works when some of the explanatory variables are dummies using the following examples.

Example: The determinants of market capitalization

We have discussed this example in previous chapters. However, an important issue involving this data sets was not discussed previously since it involved a dummy explanatory variable. By way of motivating this issue, note that most of the shares traded on the stock market are old shares in existing firms. However, many old firms will issue some new shares in addition to those already trading – what are referred to as "seasoned equity offerings" or SEOs. Furthermore, some firms that have not traded shares on the stock market in the past may decide to now issue such shares (e.g. a computer software firm owned by one individual may decide to "go public" and sell shares in order to raise money for future investment or expansion). Such shares are called "initial public offerings" or IPOs. Some researchers have argued on the basis of empirical evidence that IPOs are undervalued relative to SEOs. Accordingly, in addition to all the company characteristics we have used before, we also have a dummy variable used to investigate this possibility.

To be precise, Excel file EQUITY.XLS contains data on N = 309 US firms in 1996. All variables except the dummy variable are measured in millions of US dollars.

MARKETCAP = the total value of all shares (new and old) outstanding just after the firm issued the new shares. This is calculated as the price per share times the number of shares outstanding.
DEBT = the amount of long-term debt held by the firm.
SALES = total sales of the firm.
INCOME = net income of the firm.
ASSETS = book value of the assets of the firm.
SEO = a dummy variable that equals 1 if the new share issue is an SEO and equals 0 if it is an IPO.

Example: Explaining house prices

In the previous chapter, we worked through an extended example that investigated the factors influencing housing prices in Windsor, Canada. Recall that the explanatory variables we used in that chapter were all quantitative (e.g. lot size of property measured in square feet, the number of bathrooms). However, there are other factors that might influence housing prices that are not directly quantitative. Examples include the presence of: a driveway, air conditioning, a recreation room, a basement, and gas central heating. All these variables are Yes/No qualitative variables (e.g. Yes = the house has a driveway/No = the house does not have a driveway).

In order to carry out a regression analysis using these explanatory variables, we first need to transform them into dummy variables by changing the Yes/No into 1/0. Using the letter D to indicate dummy explanatory variables, we can define:

D₁ = 1 if the house has a driveway (= 0 if it does not).
D₂ = 1 if the house has a recreation room (= 0 if not)
D₃ = 1 if the house has a basement (= 0 if not)
D₄ = 1 if the house has gas central heating (= 0 if not)
D₅ = 1 if the house has air conditioning (= 0 if not)

For instance, a house with a driveway, basement and gas central heating, but no air conditioning nor recreation room would have values for these variables of D₁ = 1, D₂ = 0, D₃ = 1, D₄ = 1 and D₅ = 0. These variables (and many others) are in data set HPRICE.XLS.

Once qualitative explanatory variables have been transformed into dummy variables, regression can be carried out in the standard way and all the theory and intuition developed in previous chapters can be used.

Why, then, are we allocating an entire chapter to this topic? There are two answers to this question. First, regression with dummy explanatory variables is quite common and the interpretation of coefficient estimates is somewhat different. For this reason it is worthwhile discussing the interpretation in detail. Second, regression with dummy explanatory variables is closely related to another set of techniques called Analysis of Variance (or ANOVA for short). ANOVA is not used that often by financial researchers (although in the field of corporate finance it is sometimes used), but it is an extremely common tool in other social and physical sciences such as sociology, education, medical statistics and epidemiology.

While most computer software packages such as Excel have ANOVA capabilities, the terminology of ANOVA is quite different from that used by financial analysts, so ANOVA may seem confusing and unfamiliar to you (e.g. the Excel Tools/Data Analysis menu has several ANOVA choices referring to "Single factor", "Two-factor with replication", "Two-factor without replication"). What we should note here, however, is that regression with dummy explanatory variables can do anything ANOVA can. In fact, regression with dummy variables is a more general and more powerful tool than ANOVA. For instance, the terms "Single factor ANOVA" or "Two-factor ANOVA" refer to the number of dummy explanatory variables. Excel (and most common computer packages that perform ANOVA), can handle no more than two. However, Excel allows for up to 16 explanatory variables in its multiple regression facilities and, thus, can handle very complicated ANOVA models. In short, if you know how to use and understand regression, then you have no need to learn about ANOVA.

Simple regression with a dummy variable

We begin by considering a regression model with one dummy explanatory variable, D:

If we carry out OLS estimation of the above regression model, we obtain

The straight-line relationship between Y and D gives a fitted value for the ith observation of:

Note, since D_i is either 0 or 1, Ŷ_i =

Example: Explaining house prices (continued from page 111)

Table 7.1 gives computer output from a regression of Y = house prices on D = air conditioning dummy using data from HPRICE.XLS. Note that an examination of the P-value or the confidence interval (i.e. Upper 95%, Lower 95%) shows us that β is strongly significant. Furthermore,

Table 7.1. Regression of house prices on air conditioning dummy.

	Coefficient	Standard error	t-stat	P-value	Lower 95%	Upper 95%
Intercept	59884.85	1233.50	48.55	7.1E – 200	57461.84	62307.86
D	25995.74	2191.36	11.86	4.9E – 29	21691.18	30300.32

However, there is another, closely related, way of thinking about regression results when the explanatory variable is a dummy. In the case of houses without air conditioning D_i = 0 and hence Ŷ_i = 59,885. In other words, our regression model finds that houses without air conditioning are worth on average $59,885. In the case of houses with air conditioning, D_i = 1 and the regression model finds that Ŷ_i =

To provide more intuition, note that if we had not carried out a regression, but simply calculated the average price for houses with air conditioning, we would have found this figure to be $85,881. If we had then calculated the average price for houses without air conditioning, we would have found them to be worth $59,885. That is, we would have found exactly the same results as in the regression analysis.

Remember, however, the discussion of the omitted variables bias in Chapter 6. The simple regression in this example is omitting many important explanatory variables. We definitely cannot use the results of this simple regression to make statements like "Adding an air conditioner to your house will raise its value by $25,996". Since air conditioners cost a few hundred dollars, the previous statement is clearly ridiculous.

Example: The determinants of market capitalization (continued from page 110)

Table 7.2 gives computer output from a regression of Y = market capitalization on SEO = the dummy variable which equals 1 for SEOs (= 0 for IPOs) from EQUITY.XLS.

Using similar reasoning as for the house price example, we can say that the companies issuing SEOs do tend to be worth more ($637.78 million more) than IPO companies and that this result is statistically significant (since the P-value is less than 0.05). However, this regression may too suffer from omitted variables bias. It is possible that the companies issuing SEOs have a greater market capitalization simply because they tend to be bigger, more established and more profitable than the IPO companies.

Table 7.2. Regression of market capitalization on SEO.

	Coefficient	Standard error	t-stat	P-value	Lower 95%	Upper 95%
Intercept	191.795	253.642	0.756	0.450	−307.301	690.891
SEO	637.780	296.583	2.150	0.032	54.188	1221.371

Multiple regression with dummy variables

Now, consider the multiple regression model with several dummy explanatory variables:

OLS estimation of this regression model and statistical analysis of the results can be carried out in the standard way. To aid in interpretation, we return to the house-pricing example.

Example: Explaining house prices (continued from page 113)

Consider the case where we have two dummy explanatory variables, D₁ = 1 if the house has a driveway (= 0 if not) and D₂ = 1 if the house has a recreation room (= 0 if not). These dummy variables implicitly classify the houses in the data set into four different groups:

Houses with a driveway and recreation room (D₁ = 1 and D₂ = 1).
Houses with a driveway, but no recreation room (D₁ = 1 and D₂ = 0).
Houses with no driveway, but with a recreation room (D₁ = 0 and D₂ = 1).
Houses with no driveway and no recreation room (D₁ = 0 and D₂ = 0).

Keep this classification in mind as we interpret Table 7.3, which contains results from a regression of house price (Y), on D₁ and D₂.

Table 7.3. Regression of house price on driveway and recreation room dummies.

	Coefficient	Standard error	t-stat	P-value	Lower 95%	Upper 95%
Intercept	47099.08	2837.62	16.60	2.42E – 50	41525.02	52673.14
D1	21159.91	3062.44	6.91	1.37E – 11	15144.22	27175.60
D2	16023.69	2788.63	5.75	1.52E – 08	10545.86	21501.51

Putting in either 0 or 1 values for the dummy variables, we obtain the fitted values for Y for the four categories of houses:

If D₁ = 1 and D₂ = 1, then Ŷ =
If D₁ = 1 and D₂ = 0, then Ŷ =
If D₁ = 0 and D₂ = 1, then Ŷ =
If D₁ 0 and D₂ = 0, then Ŷ =

In short, multiple regression with dummy variables may be used to classify the houses into different groups and to find average house prices for each group. Alternatively, results may be presented directly as coefficient estimates. For instance,

Exercise 7.3

For this question use Y = the price of a house and the dummy variables D₁ = 1 if the house has a driveway (= 0 otherwise) and D₂ = 1 if the house has a recreation room (= 0 otherwise) from the house price example (it can be obtained from HPRICE.XLS). Without using regression techniques, calculate the average price of the four types of houses listed in the previous example (e.g. a house with a driveway and a rec. room, etc.). Hint: What do you obtain if you multiply a dummy variable by Y? How do these average price numbers relate to the regression coefficients and results in the previous example?

Exercise 7.4

For this question use data set HPRICE.XLS and the five dummy variables, D₁ to D₅, listed at the beginning of the chapter (i.e. the dummy variables for whether a house has a driveway, recreation room, basement, gas central heating and air conditioning).

With five dummy variables, how many classes of houses are possible? (e.g. houses with a driveway, recreation room, basement and gas central heating but no air conditioning comprise one class.) What implications does this have for interpreting regression results as in the previous example?
How would you calculate the number of houses in each group using a computer package like Excel? For instance, of the 546 houses in the data set, how many have a driveway, gas central heating and air conditioning, but no recreation room and no basement?
Run a regression of Y = house price on the five dummies.
Discuss the statistical significance of the explanatory variables.
Calculate the average price for a few chosen types of housing (e.g. those with a driveway, recreation room and basement but no gas central heating and no air conditioning).
Which house characteristic tends to raise the price of a house the most?

Multiple regression with both dummy and non-dummy explanatory variables

In the previous discussion, we have assumed that all the explanatory variables are dummies but, in practice, you may often have a mix of different types of explanatory variables. The simplest such case is one where there is one dummy variable (D) and one quantitative explanatory variable (X) in a regression:

The interpretation of results from such a regression can be illustrated in the context of an example.

Example: Explaining house prices (continued from page 115)

If we regress Y = house price on D = air conditioner dummy and X = lot size, we obtain

Here things are not quite so simple since we obtain Ŷ_i = 52,868 + 5.638 X_i = if D_i = 1 (i.e. the ith house has an air conditioner) and Ŷ_i = 32,693 + 5.638X_i if D_i = 0 (i.e. the house does not have an air conditioner). In other words, there are two different regression lines depending on whether the house has an air conditioner or not. Contrast this point with the discussion in example above where we had only one dummy explanatory variable. In that case, the regression implied that the average price of the house differed between houses with and without air conditioners. Here we are saying a wholly different regression line exists. In other words, we cannot simply state (as we did in examples in previous examples in this chapter) what the average value of different groups of houses will be.

We can, however, say that

It is worthwhile to examine more closely the two different regression lines that exist for houses with and without air conditioners. Note that they both have the same slope,

We can extend the previous discussion to the case of many dummy and non-dummy explanatory variables. An example having two dummy and two non-dummy explanatory variables is the following regression model:

The interpretation of results from this regression model combines elements from all the previous examples in this chapter.

Example: Explaining house prices (continued from page 117)

If we regress Y = house price on D₁ = dummy variable for driveway, D₂ = dummy variable for recreation room, X₁ = lot size and X₂ = number of bedrooms we obtain:

If D₁ = 1 and D₂ = 1, then Ŷ =
If D₁ = 1 and D₂ = 0, then Ŷ = 9,862 + 5.197X₁ + 10,562X₂. This is the regression line for houses with a driveway but no recreation room.
If D₁ = 0 and D₂ = 1, then Ŷ = 8,233 + 5.197X₁ + 10,562X₂. This is the regression line for houses with a recreation room but no driveway.
If D₁ = 0 and D₂ = 0, then Ŷ = −2,736 + 5.197X₁ + 10,562X₂. This is the regression line for houses with no driveway and no recreation room.

That is, with two dummy variables we have four different regression lines. All of these lines have the same slopes but different intercepts. The coefficients on the dummy variables,

The following are a few of the types of verbal statements that we can make about the regression results:

"Houses with driveways tend to be worth $12,598 more than similar houses with no driveway".
"If we consider houses with the same number of bedrooms, then adding an extra square foot of lot size will tend to increase the price of a house by $5.20".
"An extra bedroom will tend to add $10,562 to the value of a house, ceteris paribus".

We should stress, however, that all such statements assume that omitted variables bias is not a problem in the regression. Furthermore, statements which imply causality (e.g. "adding an extra square foot of lot size will tend to increase the price of the house by $5.20") are only valid if it is truly the case that the explanatory variable causes the dependent variable (see Chapters 4 and 6 for further discussion of causality in regression).

Exercise 7.5

For this question use data set HPRICE.XLS, the five dummy variables, D₁ to D₅, listed in Exercise 7.4 and the following four non-dummy explanatory variables:

X₁ = the lot size of the property (in square feet)
X₂ = the number of bedrooms
X₃ = the number of bathrooms
X₄ = the number of storeys (excluding the basement).

Run a regression of Y on D₁, ..., D₅, X₁, ..., X₄.
Discuss which variables are statistically significant.
Which of the characteristics measured by the dummies has the largest effect on housing prices?
Choose particular configurations of the dummy variables (e.g. one indicating a house with: a driveway, no recreation room, a basement, no gas central heating and no air conditioner) and write out the formula for the regression line.
Discuss results relating to the non-dummy explanatory variables, paying particular reference to the ceteris paribus conditions.

Example: The determinants of market capitalization (continued from page 114)

Table 7.4 presents results from a regression of market capitalization on a regular explanatory variable, ASSETS and the SEO dummy variable. Somewhat surprisingly, it seems that the book value of assets has little effect on market capitalization (since this variable is statistically insignificant). The SEO dummy variable is positive and significant. Hence, our finding from the simple regression that IPOs did seem to be undervalued is holding up in this multiple regression.

Table 7.4. Regression of house price on ASSETS and SEO.

	Coefficient	Standard error	t-stat	P-value	Lower 95%	Upper 95%
Intercept	191.728	254.047	0.755	0.451	−308.172	691.628
ASSETS	0.0008	0.006	0.153	0.879	−0.01	0.012
SEO	635.79	297.340	2.138	0.033	50.7	1220.881

Interacting dummy and non-dummy variables

We used the dummy variables above in a way that allowed for different intercepts in the regression line, but the slope of the regression line was always the same. We can, however, allow for different slopes by interacting dummy and non-dummy variables. To understand this consider the following regression model:

D and X are dummy and non-dummy explanatory variables, as above. However, here we have added a new variable Z into the regression and we define Z = DX.

How do we interpret results from a regression of Y on D, X and Z? This question can be answered by noting that Z is either 0 (for observations with D = 0) or X (for observations with D = 1). If, as before, we consider the fitted regression lines with D = 0 and D = 1 we obtain:

If D = 1 then Ŷ = (
If D = 0, then Ŷ =

In other words, two different regression lines corresponding to D = 0 and D = 1 exist and have different intercepts and slopes. One implication is that the marginal effect of X on Y is different for D = 0 and D = 1. In a written report, you could write up each of the regression lines separately using the terminology and interpretation of Chapters 4 and 6.

Exercise 7.6

For this question use data set HPRICE.XLS, the five dummy variables, D₁ to D₅ and the four non-dummies X₁, ..., X₄discussed in Exercise 7.4. Experiment with different configurations of these explanatory variables with some interaction terms (e.g. try including 10 explanatory variables: D₁ to D₅ and the four non-dummies X₁, ..., X₄ plus an interaction term D₁X₂, say). Can you find any interaction terms (i.e. Zs) that are statistically significant? Explain in words what your findings are.

Exercise 7.7

For this question use data set EQUITY.XLS containing the SEO dummy DEBT, SALES, INCOME and ASSETS (see the example at the beginning of this chapter for precise definitions of variables). Construct four new explanatory variables which interact the SEO dummy with each of the four other explanatory variables. Using these nine explanatory variables (i.e. the original five explanatory variables and four interactions), construct and justify a regression model. Begin by running a regression with all explanatory variables, then experiment with dropping out insignificant variables until you find a specification where all explanatory variables are significant (and you are not omitting any significant variables). Write a short report interpreting the results from the regression model you end up with.

What if the dependent variable is a dummy?

Thus far, we have focussed on the case where the explanatory variables can be dummies. However, in some cases the dependent variable may be a dummy. For instance, a researcher in the field of corporate finance might be interested in investigating why some companies go bankrupt and others do not, or why some raise money by issuing equity and others use debt, etc. An empirical analysis might involve collecting data from many different companies. Potential explanatory variables might include company characteristics such as debt, sales, profit, and so on. The dependent variable, however, would be qualitative (e.g. for each company data would be of the form "It went bankrupt"/"It did not go bankrupt" or "The company expanded through debt financing"/"The company did not use debt to finance its expansion") and the researcher would have to create a dummy dependent variable.

The techniques for working with dummy dependent variables^[39] are beyond the scope of this book. However, there are two facts worth noting:

There are some problems with using OLS estimation in this case. However, these problems are not enormous, so that OLS estimation might be adequate in many circumstances.
Nevertheless, there are better estimation methods than OLS. The two main alternatives are termed Logit and Probit. Computer software packages with only basic statistical capabilities (e.g. Excel) do not have the capability to perform these estimation methods. Thus, if you ever need to do extensive work with dummy dependent variable models, you will have to use another software package (e.g. Stata).

Chapter summary

Dummy variables can take on a value of either 0 or 1. They are often used with qualitative data.
The statistical techniques associated with the use of dummy explanatory variables are exactly the same as with non-dummy explanatory variables.
A regression involving only dummy explanatory variables implicitly classifies the observations into various groups (e.g. houses with air conditioning and those without). Interpretation of results is aided by careful consideration of what the groups are.
A regression involving dummy and non-dummy explanatory variables implicitly classifies the observations into groups and says that each group will have a regression line with a different intercept. All these regression lines have the same slope.
A regression involving dummy, non-dummy and interaction (i.e. dummy times non-dummy variables) explanatory variables implicitly classifies the observations into groups and says that each group will have a different regression line with a different intercept and slope.
If the dependent variable is a dummy, then other techniques which are not covered in this book should be used.

^[39]To introduce some jargon, such models are called "limited dependent variable" models. That is, the dependent variable can take on a limited range of values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Regression with dummy variables

Create new playlist

Sign In

Sign Up