Chapter 7. Regression with dummy variables

Previous chapters used quantitative data to demonstrate important statistical concepts. However, some of the data financial analysts use is qualitative (see Chapter 2 for a discussion of the distinction between qualitative and quantive data). Dummy variables, briefly described in Chapter 2, are a way of turning qualitative variables into quantitative variables. Once the variables are quantitative, then the correlation and regression techniques described in previous chapters can be used. Formally, a dummy variable is a variable that can take on only two values, 0 or 1. We will demonstrate how regression works when some of the explanatory variables are dummies using the following examples.

Once qualitative explanatory variables have been transformed into dummy variables, regression can be carried out in the standard way and all the theory and intuition developed in previous chapters can be used.

Why, then, are we allocating an entire chapter to this topic? There are two answers to this question. First, regression with dummy explanatory variables is quite common and the interpretation of coefficient estimates is somewhat different. For this reason it is worthwhile discussing the interpretation in detail. Second, regression with dummy explanatory variables is closely related to another set of techniques called Analysis of Variance (or ANOVA for short). ANOVA is not used that often by financial researchers (although in the field of corporate finance it is sometimes used), but it is an extremely common tool in other social and physical sciences such as sociology, education, medical statistics and epidemiology.

While most computer software packages such as Excel have ANOVA capabilities, the terminology of ANOVA is quite different from that used by financial analysts, so ANOVA may seem confusing and unfamiliar to you (e.g. the Excel Tools/Data Analysis menu has several ANOVA choices referring to "Single factor", "Two-factor with replication", "Two-factor without replication"). What we should note here, however, is that regression with dummy explanatory variables can do anything ANOVA can. In fact, regression with dummy variables is a more general and more powerful tool than ANOVA. For instance, the terms "Single factor ANOVA" or "Two-factor ANOVA" refer to the number of dummy explanatory variables. Excel (and most common computer packages that perform ANOVA), can handle no more than two. However, Excel allows for up to 16 explanatory variables in its multiple regression facilities and, thus, can handle very complicated ANOVA models. In short, if you know how to use and understand regression, then you have no need to learn about ANOVA.

Simple regression with a dummy variable

We begin by considering a regression model with one dummy explanatory variable, D:

Simple regression with a dummy variable

If we carry out OLS estimation of the above regression model, we obtain

Simple regression with a dummy variable

The straight-line relationship between Y and D gives a fitted value for the ith observation of:

Simple regression with a dummy variable

Note, since Di is either 0 or 1, Ŷi =

Simple regression with a dummy variable

Multiple regression with dummy variables

Now, consider the multiple regression model with several dummy explanatory variables:

Multiple regression with dummy variables

OLS estimation of this regression model and statistical analysis of the results can be carried out in the standard way. To aid in interpretation, we return to the house-pricing example.

Multiple regression with both dummy and non-dummy explanatory variables

In the previous discussion, we have assumed that all the explanatory variables are dummies but, in practice, you may often have a mix of different types of explanatory variables. The simplest such case is one where there is one dummy variable (D) and one quantitative explanatory variable (X) in a regression:

Multiple regression with both dummy and non-dummy explanatory variables

The interpretation of results from such a regression can be illustrated in the context of an example.

We can extend the previous discussion to the case of many dummy and non-dummy explanatory variables. An example having two dummy and two non-dummy explanatory variables is the following regression model:

Example: Explaining house prices (continued from page 115)

The interpretation of results from this regression model combines elements from all the previous examples in this chapter.

Interacting dummy and non-dummy variables

We used the dummy variables above in a way that allowed for different intercepts in the regression line, but the slope of the regression line was always the same. We can, however, allow for different slopes by interacting dummy and non-dummy variables. To understand this consider the following regression model:

Interacting dummy and non-dummy variables

D and X are dummy and non-dummy explanatory variables, as above. However, here we have added a new variable Z into the regression and we define Z = DX.

How do we interpret results from a regression of Y on D, X and Z? This question can be answered by noting that Z is either 0 (for observations with D = 0) or X (for observations with D = 1). If, as before, we consider the fitted regression lines with D = 0 and D = 1 we obtain:

  • If D = 1 then Ŷ = (

    Interacting dummy and non-dummy variables
  • If D = 0, then Ŷ =

    Interacting dummy and non-dummy variables

In other words, two different regression lines corresponding to D = 0 and D = 1 exist and have different intercepts and slopes. One implication is that the marginal effect of X on Y is different for D = 0 and D = 1. In a written report, you could write up each of the regression lines separately using the terminology and interpretation of Chapters 4 and 6.

What if the dependent variable is a dummy?

Thus far, we have focussed on the case where the explanatory variables can be dummies. However, in some cases the dependent variable may be a dummy. For instance, a researcher in the field of corporate finance might be interested in investigating why some companies go bankrupt and others do not, or why some raise money by issuing equity and others use debt, etc. An empirical analysis might involve collecting data from many different companies. Potential explanatory variables might include company characteristics such as debt, sales, profit, and so on. The dependent variable, however, would be qualitative (e.g. for each company data would be of the form "It went bankrupt"/"It did not go bankrupt" or "The company expanded through debt financing"/"The company did not use debt to finance its expansion") and the researcher would have to create a dummy dependent variable.

The techniques for working with dummy dependent variables[39] are beyond the scope of this book. However, there are two facts worth noting:

  1. There are some problems with using OLS estimation in this case. However, these problems are not enormous, so that OLS estimation might be adequate in many circumstances.

  2. Nevertheless, there are better estimation methods than OLS. The two main alternatives are termed Logit and Probit. Computer software packages with only basic statistical capabilities (e.g. Excel) do not have the capability to perform these estimation methods. Thus, if you ever need to do extensive work with dummy dependent variable models, you will have to use another software package (e.g. Stata).

Chapter summary

  1. Dummy variables can take on a value of either 0 or 1. They are often used with qualitative data.

  2. The statistical techniques associated with the use of dummy explanatory variables are exactly the same as with non-dummy explanatory variables.

  3. A regression involving only dummy explanatory variables implicitly classifies the observations into various groups (e.g. houses with air conditioning and those without). Interpretation of results is aided by careful consideration of what the groups are.

  4. A regression involving dummy and non-dummy explanatory variables implicitly classifies the observations into groups and says that each group will have a regression line with a different intercept. All these regression lines have the same slope.

  5. A regression involving dummy, non-dummy and interaction (i.e. dummy times non-dummy variables) explanatory variables implicitly classifies the observations into groups and says that each group will have a different regression line with a different intercept and slope.

  6. If the dependent variable is a dummy, then other techniques which are not covered in this book should be used.



[39] To introduce some jargon, such models are called "limited dependent variable" models. That is, the dependent variable can take on a limited range of values.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.86.6