In the previous section, we saw how to perform simple regression analysis in R. We also saw that multiple regression is more complex to compute but have discussed that most of what we have already seen applies to multiple regression as well.
In what follows, we will use a dataset of 40 cases generated from a covariance matrix obtained from a subsample of real data we collected, which is about burnout components, work satisfaction, work-family conflict, and organizational commitment in hospitals. There are six attributes in the dataset that we will analyze here; all are self-assessments made by nurses:
Commit
: Commitment to their hospital (criterion here)Exhaust
: Emotional exhaustion (one of the three components of burnout)Depers
: Depersonalization (one of the three components of burnout)Accompl
: Accomplishment (one of the three components of burnout)WorkSat
: Work satisfactionWFC
: Work-family conflictOur goal here is to understand how burnout dimensions and work satisfaction affect commitment of nurses to their hospital.
We start by generating the data and examining the correlation table and significance. Make sure the matcov.txt
file is in your working directory before running this code:
1 library(psych) 2 install.packages("MASS"); library(MASS) 3 matcov = unlist(read.csv("matcov.txt", header=F)) 4 covs = matrix(matcov, 6, 6) 5 means = c(4.47,14.95,4.87,36.08,5,1.88) 6 set.seed(987) 7 nurses = data.frame(mvrnorm(n=40, means, covs)) 8 colnames(nurses)= c("Commit","Exhaus","Depers","Accompl", 9 "WorkSat","WFC") 10 corr.test(nurses)
The output is provided here:
The values with a probability value lower than 0.05
are significant by common standards. We can see, for instance, that, in this subsample, commitment is significantly correlated with exhaustion, work satisfaction, and work-family conflict, but not with depersonalization and accomplishment. We can also see that the predictors are intercorrelated—that is, they share part of their variance. We will examine whether this constitutes a problem for a regression analysis later.
Let's plot the relationship to see if the relationships indeed seem linear:
plot(nurses)
Here, we will only comment on the scatterplots in which commitment is included. We can see that there is visibly a negative linear association between commitment and exhaustion and work-family conflict. There is visibly a positive linear relationship between commitment and work satisfaction. Notice that there are also other relations visible on the plots, such as the visible relation between work-family conflict and exhaustion. From these scatterplots, nothing in the data seems problematic for the relationships we are exploring.
We will include some more regression diagnostics, but you are also encouraged to read about the assumptions of linear regression (for instance, at http://www.basic.northwestern.edu/statguidefiles/mulreg_ass_viol.html), and check whether the data fulfils these assumptions. If the assumptions of regression are violated, it is possible (actually probable) that the results are unreliable. Readers can also read about the detection of multivariate outliers here: https://stat.ethz.ch/education/semesters/ss2012/ams/slides/v2.2.pdf.
Sadly, there is not enough space to check all these here.
We want to know if there is a relationship between our predictors and the criterion. We first want to know whether the three burnout dimensions predict commitment to the hospital.
We create the model by using the formula syntax as an argument in the lm()
function. What is on the left of the tilde (~
) sign is the criterion, on the right are the predictors, separated by a plus (+
) sign:
model1 = lm(Commit ~ Exhaus + Depers + Accompl, data = nurses)
Let's examine the coefficients and their significance in the summary of the model:
summary(model1)
The following output shows that exhaustion and accomplishment are predictors of commitment to the hospital (look at p-value under Pr(<|t|) or refer to *
)—exhaustion negatively (more emotionally exhausted people are less committed) and accomplishment positively (more accomplished people are more committed):
We can also see that p-value for F-statistic is significant (bottom of the output), and that 55 percent of variance (see Multiple R-squared) is predicted. The adjusted R-squared considers the number of predictors in the calculation of its value. It is recommended that you specify which value you use when reporting the results, or you can also report both values. Here, we can see that Adjusted R-squared is just a bit lower than Multiple R-squared, meaning that the results are not much affected by the number of predictors.
We have seen that it is important that residuals are normally distributed. We can do this visually by plotting, as in the following line of code:
hist(resid(model1), main="Histogram of residuals", xlab="Residuals")
From the preceding output, we might suspect a slight deviation from normality.
We can test this suspicion with the Shapiro-Wilk test, using the following code:
shapiro.test(resid(model1))
The output follows and shows that the residuals do not significantly depart from normality:
Shapiro-Wilk normality test data: resid(model1) W = 0.9776, p-value = 0.6001
Let's examine the Q-Q plot (type plot(model1)
and click on the RGraphics window until the appropriate plot appears). This leads to a similar conclusion but, as in the iris
preceding dataset example, some observations might prove problematic. We briefly discuss robust regression at the end of the chapter. Robust regression does not assume a normal distribution of residuals. For now, we proceed with the following regression diagnostic.
We also want to check whether there is a problem of variance inflation in our analysis—that is, whether the predictors are correlated a lot (multicollinear). For this purpose, we will rely on the vif()
function of the HH
package. The function takes the lm
formula as an argument:
install.packages("HH"); library(HH) vif(Commit ~ Exhaus + Depers + Accompl, data = nurses)
The output follows:
There are several rules-of-thumb to assess this. One is to consider vif
values higher than 10 to be problematic, another is to consider a predictor as problematic if the square root of the vif
value is higher than 2. This is not the case here, therefore, we consider our data to be non-multicollinear here.
Let's now examine whether including work-family conflict and work satisfaction permits to predict an additional part of variance. We first will ask R to fit a second model, and then will compare model1
and model2
using the anova()
function:
model2 = lm(Commit ~ Exhaus + Depers + Accompl + WorkSat, data = nurses) anova(model1,model2)
The following output shows that indeed the second model predicts additional variance in comparison to model1
(see the significance of the F statistic for the comparison (under Pr(>F)
):
Here is an analysis of the variance table:
We will now examine the second model, as the additional variance predicted is significantly different from 0:
summary(model2)
The output is provided here:
This model predicts 70 percent of variance in commitment, which is pretty good. We can see that work satisfaction is a significant predictor of commitment to the hospital, that the unique contribution of accomplishment is no longer significant (there is therefore a potential mediation), and that the contribution of exhaustion has been reduced when including work satisfaction in the model (there is therefore a potential partial mediation). This might be because of a mediation of the relationship between the two burnout components and commitment by job satisfaction. Let's test this relationship:
model3 = lm(WorkSat ~ Exhaus + Depers + Accompl, data = nurses) summary(model3)
Let's examine the output:
We can notice that 51 percent of the variance of job satisfaction is predicted by the burnout components. All three burnout components are significantly related to work satisfaction (p < .05), negatively for emotional exhaustion and depersonalization and positively for personal accomplishment.
In order to ascertain mediation, we need to proceed to Sobel tests. The bda
package contains the necessary function, called mediation.test()
. Let's try to see whether the effect of exhaustion on commitment is mediated by work satisfaction:
install.packages("bda"); library(bda) mediation.test(nurses$WorkSat,nurses$Exhaus,nurses$Commit)
In the following output, under Sobel, we can see that p.value is significant, as the presence of work satisfaction in the model decreases the effect of exhaustion, that work satisfaction is significant even though exhaustion is present in the model, and that, because the Sobel test is significant, we can confirm that there is indeed a partial mediation of the effect of exhaustion on commitment by work satisfaction. In other words, exhaustion decreases work satisfaction, and in turn, work satisfaction increases commitment.
The value resulting from the Sobel test follows a z
distribution. In order to obtain this value, the slope coefficients of the predictor regressed on the mediator (a) are multiplied by the slope coefficient of the mediator regressed on the criterion (b). This value is then divided by the square root of: b squared multiplied by the squared standard error of a plus a squared multiplied by the squared standard error of b. The formula is as follows:
Showing this is important, as very often, analysts include dozens or hundreds of predictors in their models without taking into consideration that the included predictors could themselves be related to each other. Readers are therefore advised to check for meaningful relationships between the attributes they intend to include as predictors in regression analyses before drawing conclusions on the final model!
A particularly interesting use of regression is to examine how well a model predicts new data. This is easily achieved in R. We will first build the dataset named nurses2
in the same way we did for the first dataset:
1 matcov2 = unlist(read.csv("matcov2.txt", header = F)) 2 covs2 = matrix(matcov2, 6, 6) 3 means2 = c(4.279, 13.152, 5.156, 39.28, 5.153, 1.875) 4 set.seed(987) 5 nurses2 = data.frame(mvrnorm(n=40, means2, covs2)) 6 colnames(nurses2)= c("Commit","Exhaus","Depers","Accompl", 7 "WorkSat","WFC")
To fit new data, we rely on the predict.lm()
function:
predicted = predict.lm(model1, nurses2)
This results in the vector of predicted values being assigned to the criterion, which we call predicted
.
As we have the real values for the commitment of individuals to the hospital in the second dataset as well, we can examine the correlations between those:
cor.test(predicted, nurses2$Commit)
The following output shows that the correlation between the predicted values and the real values is 0.5766194
. This value is significant and might seem pretty good at first sight:
Pearson's product-moment correlation data: predicted and nurses2$Commit t = 4.3506, df = 38, p-value = 9.848e-05 alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.3231561 0.7528925 sample estimates: cor 0.5766194
Let's square this value to know how much of the variance in the commitment of the individuals of the second sample is predicted by the model:
0.5766194 ^2 *100
The output is 33.24899
. This means only 33 percent of the variance in commitment is predicted by the model, compared to 55 percent in the training data!
Now, we can also compute the residuals:
residuals_test = nurses2$Commit - predicted
We are now able to compute the F
value for our model. We have seen that the F
value is used to assess the overall significance of the model. In our case, the F
value is obtained as follows:
The following function does just that:
1 ComputeF = function(predicted, observed, npred) { 2 DFModel = npred # the number of predictors 3 DFError = length(observed) - DFModel -1 4 SSModel = sum((predicted - mean(observed))^2) 5 SSError = sum((observed - predicted)^2) 6 MSModel = SSModel / DFModel 7 MSError = SSError / DFError 8 F = MSModel / MSError 9 F 10 }
Let's try, first, with the original model to check whether the function works fine. The output is 14.67
, which is the same as the one outputted when we requested the summary of model1
:
ComputeF(unlist(model1[5]), nurses$Commit, 3)
Now let's try to see if the new data is predicted well enough by the model:
ComputeF(predicted, nurses2$Commit, 3)
The outputted F
value is 10.4842
.
We can test this value using the following line of code. The output shows that the threshold F value at a ceiling of 0.05 on the F distribution for our model is 2.866266
:
qf(.95, df1=3, df2=36)
We can therefore, trust that our model significantly predicts new data.
52.15.135.175