6 Regression analysis

As we know how to manage and describe data we can go to the fun part and analyze data for our research. As this is an introduction to Stata, we will start with a method that is very popular and also a great foundation for the large family of regression-based applications. In this chapter we will learn how to run and interpret a multiple regression. In the next chapter we will furthermore check the most important assumptions that must hold, so our analyses will yield valid results.

6.1 Research question

Any good scientific research must start with a theoretically justified research question which has a relevance for the public. Finding a good research question, that is adequate for the scope of a term paper (15 pages or 8,000 words can be really short!), is a challenge, yet it is worth spending some time on this task, as your project (and grade) will really benefit from a well-formulated and manageable research question. Usually you start with a certain motivation in mind, and proceed with a literature review to check which questions are still open and find out where the interesting research gaps lie. You will then develop a theoretical framework, based on the previous literature and the general theoretical foundations of your field.

When this is done, you should formulate testable hypotheses that can be answered with the dataset you can access. This is the second crucial step, as it is easy to lose the focus and end up with hypotheses that are somewhat related to the research question, but vague and unclear. You really want to pinpoint a hypothesis, as your project will benefit by a clear and precise formulation. For a general overview of this process, refer to King et al. (1995): 3–28. Also try to formulate your research questions and hypotheses in a causal fashion, even when only observational data is available for analysis. As pointed out recently, almost all research is finally interested in causality and therefore, it is a good idea to spell this out explicitly (Hernán, 2018).

We imagine all this is done, so we can start with the analyses. Of course, as this book is short, we will deal with ad-hoc hypotheses to practice the methods. It should be clear that for any real seminar paper you should invest much more time in the mentioned aspects.

As we want to continue using our dataset of working women we will test the effects of certain variables on wage. We start by formulating testable hypotheses:

  1. Union members will earn more than non-union members (H1).
  2. People with more education will earn more than people with less education (H2).
  3. The higher the total work experience, the higher the wage (H3).

As you might have noticed, our hypotheses use three different kinds of variable scalings: binary (H1), ordinal (H2) and metric (H3). Our dependent variable, that is the variable we want to explain, is metric (wage). When you want to use a linear (multiple) regression analysis, this must always be the case.36

6.2 What is a regression?

Linear regression37 is a widely used statistical method to model the relationship between precisely one dependent variable (DV; the variable you want to explain or predict) and one or more independent (explaining) variables (IV). By including relevant control variables, researchers hope to furthermore explain the causal relationship between variables (as outlined in chapter five).

A regression produces an equation to describe the relationship between the variables in the model mathematically. Assuming we only have one IV, this would look something like

DV=β0+β1IV+

where β0 is the constant (also called intercept), β1 the regression coefficient of the IV and ϵ is the error term. The error term “collects” the effect of all omitted variables that have an independent influence on your dependent variable, but are, as the term omitted describes, not captured in your model. Let’s take an example. We want to regress income on motivation to see whether income can be explained by motivation. Our equation would look like this (with made up numbers):

Income=400+20motivation

This would mean that a person with a motivation of zero (whatever this means depends on your coding system) would earn 400, and every point more on the motivation scale would increase the income by 20. When your motivation is 10 you would receive an income of 400 + 20 · 10 = 600

Normally you would also include the error term in the equation, which is all the influence that cannot be explained by the model. For example, we assume that income cannot be explained using just motivation, as skill or intelligence seem further important factors that are not considered. Thus our prediction will have a certain error. Remember that in a real-world setting every model will have (hopefully only) a minor error, as all factors can never be totally accounted for.

6.3 Binary independent variable

In our first model we want to inspect the relationship between wage and being a union member. Remember that our independent variable is coded binary, where the numerical value 1 is given to people who are members. If not already open, load the data and run the regression:

sysuse nlsw88, clear              //Open dataset
regress wage i.union              //Run regression

or click Statistics → Linear models and related → Linear regression. Let’s take a look at the command first. Regress is the regression command in Stata, followed by the dependent variable (the one we want to explain). Note that a regression can only have one dependent variable, but one or more independent variables. These follow directly after. We use factor variable notation to tell Stata how to deal with a binary variable. Binary, nominal or ordinal variables receive the prefix i. (optional for binary variables), which helps Stata to run the correct model. Continuous or metric variables receive the prefix c. (which is optional, but often helpful for getting a quick overview of the model).

Let’s take a look at the output.

The upper left corner of the table shows the decomposition of explained variance. These numbers are used to calculate some other statistics, like R-squared, which is depicted on the right hand side. Usually you do not have to care about this part of the table, as one uses better indicators to assess the model. The more interesting statistics are found on the right hand side.

Number of obs is the number of cases used in your model. As listwise deletion is the standard, only cases will be used which have complete information on every variable in the model. For example, if you use ten IVs in your model and one person only has information about nine of these (and the last one has a missing value), then this person is not used in calculating the model.

F(1, 1876) is used to calculate the Prob > F value. This is an omnibus-test which checks whether your model, in general, explains the variance of the dependent variable. If this value is not significant (larger than 0.05), your independent variable(s) might not be related to your dependent variable at all, and you should probably refine your model. As long as this number is low your model seems fine (as in our case here).

R-squared is the percentage of the overall variance that is explained by your model. This value is quite low and tells us that when we want to predict wages, the variable union alone is not sufficient to reach satisfying results. Usually it is not a good idea to assess the quality of a model using only the explained variance, yet it gives you a rough impression of the model fit. Keep in mind that you can still test causal mechanisms, even if R-squared is quite low. You can calculate this statistic by hand using the information on the left (751/32,613 = 0.023).

Adj R-squared is the adjusted R-squared, which is corrected to account for some problems that R-squared introduces. R-squared will always become larger the more controls you include, even if you introduce “nonsensical” independent variables to the model. To “punish” this, adjusted R-squared corrects for the number of explaining variables used.

Root MSE is the square root of the Mean Square Error. This value can be interpreted as follows: if you were to predict the wage of a person, using only the information in the model (that is information about the union status), you would, on average, make an error of about 4.12 $. But keep in mind that this number depends on your model and should not be compared across different models.

The more interesting numbers are in the lower part, where the coefficients and significance levels are shown. We can formulate our regression equation, which would be

Wage=7.2+1.47Union

7.2 is the constant (or intercept), while 1.47 is the coefficient of our independent variable. As union can only take two values (0 and 1), there are only two possible results. Non-union members will have an average wage of 7.2 while union-members will have an average value of 7.2 + 1.47 = 8.67. The p-value (P>|t|) of this coefficient is below 0.05, so we know the result is significant. Thus we conclude that there is a real effect of union membership on wages, and the result is positive. Please note that this result might be spurious, as there are no control variables in the model. Formulated differently, the effect may disappear when we introduce more explanatory variables to the model. We will deal with this in model 3.

6.4 Ordinal independent variable

After discussing the most simple case, we move on to a model with an ordinal independent variable. As there is no real ordinal variable in the dataset that could be used directly, we have to create one.38 We will use the metric variable, years of schooling (grade), and transform it (low education, medium education, high education). After this is done, we use a crosstab to inspect if we made any mistakes.

recode grade (0/10 = 1 “low education”) ///
   (11/13 = 2 “medium education”) ///
   (14/18 = 3 “high education”), generate(education)39
tabulate grade education                            //Check results

Lowering the metric of a scale should always be justified on a theoretical base, as information is lost in the process, yet it can be a valuable technique to adapt available variables to theoretical concepts. After this is done, we run our model using the created variable.

regress wage i.education

Stata’s factor variable notation makes our life much easier. Usually any ordinal or nominal variable has be to be recoded into dummy variables to be used in regressions. For example, we would have to recode our variable education into two binary variables, “medium_education” and “high_education”. One category (in this case, “low_education”) would be our reference-category. Luckily we can skip this step by using a Stata shortcut.

As the category low education is our reference, we find this effect in the constant. A person with low education will earn 4.88 on average. A person with medium education will make 1.90 more than that (4.88 + 1.90), a person with high education 5.21 more (4.88 + 5.21). All variables are highly significant, telling us that in comparison to low education, the two other groups make a significant difference. We would thus conclude that education has a positive effect on wage, and education pays off financially. Also note that our R-squared is higher than in the first model, telling us that education can explain more variation of wage than the membership in a union.

Stata will always use the category within a variable with the lowest numerical value as a reference category. You can show the category of reference explicitly by typing

set showbaselevels on, perm40

before running your regression command.

When you want to change your category of reference, recode the independent variable or use an option for the factor variable notation.

regress wage ib3.education

Here category three (high education) would be the reference. You will notice that all coefficients will change. This must happen since all your relative comparisons will change as well. Someone with low education will have a wage of 10.09 – 5.21 = 4.88, which is exactly the same value, as calculated above. You see that changing the reference categories of independent variables does not change results. You should choose them in a fashion that helps you understand the results.

What about ANOVAs?

Analysis of variance (ANOVA) is a quite popular method in psychology, and some readers might wonder why it is not included in the book. The answer is that ANOVAs are very similar to OLS regressions and internally Stata uses regression algorithms when calculating ANOVAs. Furthermore, you can always “simulate” an ANOVA with a regression model, while this is not possible the other way round. My personal advice is that, although many courses start with ANOVAs, you can directly use regression models and get the same results while keeping all your options open when it comes to advanced analyses, which can be built on your initial model.

But let’s start simply. Usually an ANOVA is used to determine whether there are any differences in a metric variable, between three or more groups. In our example, we test whether blood pressure is different for three groups which are defined by the age of participants. You can think of an ANOVA as a more general version of a t-test (see page 69) which tells you whether there are any differences in the mean outcome for an arbitrary number of groups. When the result is significant you know that at least two groups show a statistically significant difference in the means. We can run an example by using another dataset (so make sure to save your current results before typing the next commands!):

sysuse bpwide, clear                    //Open new example dataset
oneway bp_before agegrp, tabulate       //Run ANOVA

or click Statistics → Linear models and related → ANOVA/MANOVA → One-way ANOVA.

The command runs the ANOVA, where your dependent variable (which has to be continuous) is blood pressure and the groups are defined by the second variable, agegrp. The option tabulate displays the means and standard deviations for each group. You can also obtain all pairwise contrasts when you include the option bonferroni.

In the upper part of the table you see the means for each group, which already tell us that at least one difference seems quite large (151.7 VS 162.6). The interesting part, in the next section of the output, is the result under Prob > F, which is smaller than 0.05 and therefore, indicates that the result is significant. We conclude that at least two groups show a statistically significant difference in the means. Bartlett’s test indicates a non significant result (Prob>chi2 = 0.409), which is good, otherwise we would have to conclude that variances between the groups were unequal. This would violate the assumptions of the ANOVA.

You can easily come to the same conclusions using a regression model. Just type:

regress bp_before i.agegrp

Using the i. prefix tells Stata to treat agegrp as a categorical variable. At the top of the output you see the F-statistic and Prob > F, which are identical to the ones displayed by the ANOVA (11.23 and 0.0000). You can also see more detailed results when you check the lower parts of the output. The first category (30–45) is used as a reference and therefore, not shown in the table. We learn that the third group (60+) displays a highly significant result (P>|t| is smaller than 0.05 here), therefore, we conclude that there is a difference from the reference-group.

What if you want to test if age-group 2 (46–59) is statistically different from age-group 3 (60+)? You have several possibilities: firstly you could change the category of reference in the regression model (see page 89). Secondly, you can use the test command to test this numerically:

test 2.agegrp = 3.agegrp           //Output omitted

You can find this test under Statistics → Postestimation → Test, contrasts, and comparisons of parameter estimates and click Create. As the result (0.0019) is significant (Prob > F is smaller than 0.05) you know that the two group means are different from each other.

In summary, you should keep in mind that ANOVAs and linear regressions are very similar from a technical point of view, while regressions are more versatile and powerful for advanced analyses, hence, the emphasis on these models in the rest of the book.

6.5 Metric independent variable

Lastly, we want to use a metric (continuous) explanatory variable. We type

regress wage c.ttl_exp

The table shows a highly significant effect, with a numerical value of 0.33, and 3.61 for the constant. Thus a person with zero years total work experience would earn 3.61$ on average, and with each year more she would receive 0.33$ more. Therefore, a person with five years of education would earn 3.61 + 5 · 0.33 = 5.26 Note this is the case, as work experience is coded in years. If we used months instead, all the numbers would be different, but the actual effects would stay the same. We conclude that the effect of work experience on wage is positive, which makes sense intuitively, as experience should be beneficial for workers.

In chapter five we learned that controlling for the correct variables is essential when we want to recover causal effects. We will do this by including some more variables: union status, place of residence (south VS other) and years of education (as a metric variable). Therefore, our final (saturated) model is estimated with the following command41:

regress wage c.ttl_exp i.union i.south c.grade

Note that the coefficient of work experience became slightly smaller, yet is still highly significant. R-squared also increased drastically, as our model with four independent variables is able to explain much more variance in wages. One could interpret our results as follows: “the effect of work experience on wage is highly significant and each year more experience will result in a wage plus of 0.27$ on average , when controlling for union-membership, region and years of education”. Usually it is not necessary to explain the effect of every single control variable, as you are interested in one special effect. Note that we call a model with more than one independent variable a multiple regression.

It is worth taking some time to understand the correct interpretation of the result. The positive effect of work experience is independent of all other variables in the model, which are: union status, region and education. Or to formulate it differently: every extra year of work experience increases the wage by 0.27$, when holding all other variables in the model constant (ceteris paribus interpretation). When you build your model using the framework of causal analysis, and have selected the correct variables to control for (closing all back-door paths), you could even state that work experience is a cause of income (which is probably wrong in our example, as we have not done all the important steps, and have created an ad-hoc model to give as a general example).

In chapter ten we will continue to interpret and visualize effects that we have estimated using regressions so far. Finally, it is time to come back to our hypotheses and see whether our results do support or reject them.

  • H1 claims that union members will earn more than non-union members. As the coefficient of the variable union is positive, and the p-value highly significant (page 85), we can state: union members do, on average, earn more money than non-union members. We therefore, accept hypothesis one.42
  • H2 claims that more educated people will earn more money. We can state: people with a higher education do earn more money on average, as, in contrast to the lowest category of education, both other coefficients show a positive and highly significant result (page 88). We therefore, accept hypothesis two.
  • H3 claims that people with more work experience earn more money. We can state: as the coefficient for work experience is positive and highly significant, people with more work experience do earn more money on average after controlling for union membership, region and education (page 92). We therefore, accept hypothesis three.

Confidence intervals

Confidence intervals are a common type of interval estimation for expressing uncertainty in a statistic. In the regression commands so far you have seen that Stata also reports a confidence interval for each coefficient. The problem is that we (mostly) work with samples from a much greater population, that means all statistics we calculate are probably not identical to the result we would get if we could use all cases that exist. For example, our sample consists of working women between 34 and 46 years of age. We want to know the average work experience they have, which yields 12.53 years (summarize ttl_exp). Suppose we not only have a sample, but interview every single woman in the USA between 34 and 46. We would then probably get a different result and not 12.53. A confidence interval tries to give us a measurement, to see how much trust we can put in our statistic. To compute it, just type

ci means ttl_exp

or click Statistics → Summaries, tables, and tests → Summary and descriptive statistics → Confidence intervals. The standard is a 95% interval. The standard error of the mean is 0.097, the calculated confidence interval is [12.34; 12.73]. Be careful with the interpretation, as many people get it wrong and it is even printed incorrectly in journal articles (Hoekstra, Morey, Rouder, Wagenmakers, 2014)! A correct interpretation would be: “If we were to redraw the sample over and over, 95% of the time, the confidence intervals contain the true mean.”43 Of course, our sample must consist of a random sample of the population for this statement to be valid. When we know that our sample is biased, say we only interviewed people from New York, then the entire statistic is biased.

To understand the interpretation, remember that there must be a true value for our statistic, which we would know if we had interviewed not a sample, but every single person. Imagine we went out and collected a sample, not only once, but 100 times, independently. Then in 95 of these 100 samples the calculated confidence interval would contain the true value. In five of the samples it would not. Also, remember that a confidence interval gets larger when we increase the level. That is why a 99% confidence interval for work experience would be [12.28; 12.79] and thus broader than the one calculated before. To see why this is true, consider the extreme case, a 100% confidence interval. As this must include the true value it has to be from zero to infinity!

TL;DR44: Never use this interpretation: “the probability, that the true value is contained in the interval, is 95%.”

6.6 Interaction effects*

Until now we have assumed that all effects have the same strength for all persons. For example, the effect of work experience on wage is 0.27, no matter whether you are a union member or not, whether you are from the south, or not, or whatever your education is. We call this the average effect of experience and often this is good enough. But sometimes we think, based on our theoretical reasoning, that there are subgroups which are affected quite differently by some variables. For example, we could argue that there is an interaction effect between being union members and having college education, with respect to wages. Stated differently, we expect that the possession of a college degree moderates how union-membership affects income. In this example union-membership is the main effect, while college education is the interaction effect. Finally, it is recommended you to draw a causal graph of the model, which could look like this (Figure 6.1):

Figure 6.1: An interaction effect.

Interaction effects might sound complicated at first, but are very common in data analysis, and it is very useful to take some time and make sure that you really understand what this means. Due to factor variable notation it is very easy to calculate these effects in Stata. Generally, it is recommended you have a three stage procedure: Your first model only includes the main effect. The second model includes the main effect, all control variables and also the interaction variable, but without the interaction effect itself. Finally, the third model furthermore adds the interaction effect. As we want to keep it simple and only specify total work experience as a control variable, we would run these three models:

regress wage i.union                         //M1 (Output omitted)
regress wage i.union i.collgrad c.ttl_exp    //M2 (Output omitted)
regress wage i.union##i.collgrad c.ttl_exp   //M3

By typing two hash signs (##) between i.union and i.collgrad you tell Stata to calculate the main effects of union-membership, the main effect of college education and additionally the interaction effect between both. When you just type a single hash - sign, Stata will only calculate the interaction effect, which is usually not what we want. Again, the order of the independent variables is arbitrary. Also note that this notation is symmetric. Stata does not know which variable is the “main” and which is the interaction as, from a mathematical point of view, this cannot be distinguished. It is up to you to define and interpret the results in a way you desire, just like we did in Figure 6.1.

In the following I will show different options on how to interpret and visualize the results. Which option seems most comfortable is up to you.

6.6.1 The classic way

First, we will use the output of the model and interpret results directly, which can be slightly challenging. We see that the coefficient of union is positive and highly significant (1.27), telling us that union-members earn more on average than non-members. Exactly the same is true for college education (3.40). Finally, we see that the interaction effect (−0.936) is negative and also significant (p-value smaller than 0.05). We can now calculate the overall effect of union-membership for two groups, the people with college education and those without.

Effect(union|noncollege)=1.27+(0.936)collegeeducation=1.27+0=1.27Effect(union|college)=1.27+(0.936)collegeeducation=1.270.936=0.334

45

Here we calculate the effect of union-membership for both groups, using the classic way. Our conclusion is that the effect is much stronger for people without college, which means that people with less education profit more from union-memberships. You have noticed that this procedure requires you to calculate effects by hand, which becomes rapidly more complex as soon as more variables or other interactions are present. Therefore, we would kindly ask Stata to do this for us.

6.6.2 Marginal effects

Using margins we can calculate marginal effects of union-membership easily, which are the effects for the two groups (people with college education VS people without college education). Therefore, we run

margins, dydx(union) at(collgrad=(0 1))

If you want to use point-and-click, go to Statistics → Postestimation → Marginal analysis → Marginal means and marginal effects.

This command tells Stata to calculate the marginal effect of union-membership (the dydx-option), separately for people with college education (collgrad = 1) and people without (collgrad = 0). Note that we run this command directly after the model and that we do not have to specify anything else, because Stata regards the factor-variable notation and includes the interaction. The results are what we calculated manually (differences are due to rounding). Therefore, we come to the same results, minus the trouble of calculating it ourselves. Also keep in mind that, strictly speaking, union-membership has only a significant effect for the people without college education, as this p-value is very low and below 0.05. The p-value for the other group is much higher (0.357), telling us that this coefficient is probably not statistically different from zero. Therefore, there is no real effect left for this group.

6.6.3 Predicted values

Another option I want to introduce here does not emphasize effects of certain variables, but rather uses all information in the model (therefore, also the data from the control variables) to predict the overall outcomes for certain groups (which implicitly also tells us something about effects or differences between groups). Again, we can use margins here:

margins, at(union = (0 1) collgrad=(0 1))

Stata directly calculates expected wages and also regards the effect of work experience. For example, a person who is not in a union and, also holds no college degree, would earn 6.49$ on average. When you want to compare results between groups, make sure you get the correct comparisons. In this case, we would compare group 1 VS 3 (no college education) and group 2 VS 4 (college education) to assess the effect of union-membership on wages. You can also get a visual output by typing

marginsplot

Again, we come to the conclusion that the effect of union-membership is moderated by degree of education. Union-membership has a quite strong effect for people without college education, but a much lower effect for highly educated people.

Margins is a very powerful command that we will use in chapter ten to visualize our results. When you still feel insecure about interactions, don’t be discouraged, as margins makes it simple to get informative results, even when there are many interactions present in your model. If our model was slightly more complex, even experts would not calculate these effects by hand, but use the Stata internals to get nice graphics that can be interpreted visually.

6.6.4 Separate analyses by subgroups

A final technique for dealing with interactions is to run separate regressions for each subpopulation, which is defined by your interaction variable. In the example above, our interacting variable is college education, which means we have two groups: people who have college education and people without college education. We can remove the interacting variable from our regression model and instead run two models, the first for people with college education, the second for people without.

bysort collgrad: regress wage i.union c.ttl_exp

To understand what bysort does, see how we can get the exact same results with the if qualifier:

regress wage i.union c.ttl_exp if collgrad == 0
regress wage i.union c.ttl_exp if collgrad == 1

- output omitted -

By comparing the coefficients of union, you can check whether there is any interaction. When the coefficients are quite similar, we would conclude that there is no interaction at all. In our case the results underline that there are differences (1.27 for people without college, 0.33 for people with college). You will notice that these results are very close to what we have calculated above as “marginal effects”. This split-technique is preferred when you have many variables in the model, and you expect many interaction effects. For example, when you expect interactions, not only between union and college, but also between total work experience and college, you would normally have to specify the second effect in another interaction-term. When you run separate regressions by groups, the model will implicitly account for any possible interactions between your grouping variable and any other explaining variable used in the model. Basically, this design can be helpful when you expect your groups to be highly different from each other in a large number of effects. If necessary, you could even specify explicit interactions, within this split-design, to account for higher orders of interaction (if there is any theoretical expectation of this). The main downside of this split approach is that your two groups are no longer in the same model, therefore you cannot compare coefficients easily. For example, it is no longer possible to tell if the coefficient of experience in model one (0.289) is statistically different from the same coefficient in model two (0.32).

To summarize this section, interaction effects are often highly interesting and central to many research questions. Possible examples are: that a drug has different effects on men and women, a training program affects young and old workers differently, or a newly introduced tax influences spending of large and small firms differently. Whenever you want to test interactions, you should have a clear theoretical framework in mind that predicts different outcomes. Thanks to Stata, actually calculating and interpreting interactions is as easy as can be.

6.7 Standardized regression coefficients*

Standardized regression coefficients, or beta coefficients, are quite popular in some disciplines. They basically have two applications: firstly, they make it possible to compare different studies which try to explain the same phenomena but use different units of measurement (for example, one study measures time in days, while the other in months). Secondly, they make it possible to compare effect-sizes for variables with different units within one study (for example, when you want to measure what affects life satisfaction more, the income or the number of close friends). The idea is as follows: you z-standardize (see the formula in the footnote on page 49) your dependent variable and all (metric) independent variables and use these modified variables in your regression. Stata makes this easy:

regress wage c.ttl_exp, beta

The output shows the normal coefficient (0.331) and the standardized coefficient (0.265). The interpretation is as follows: when the work experience of a woman increases by one standard deviation, the wage increases by 0.265 standard deviations. The result is highly significant.

If you would like to see how this works in detail, you can reproduce the results on your own:

quietly sum wage
generate zwage = (wage-r(mean))/r(sd)
quietly sum ttl_exp
generate zttl_exp = (ttl_exp-r(mean))/r(sd)
regress zwage c.zttl_exp

You see that the results are identical. When you want to learn more about standardized coefficients in Stata, have a look at the paper by Doug Hemken.46

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.236.70