Analyzing data in R: correlation and regression

In the previous section, we saw how to perform simple regression analysis in R. We also saw that multiple regression is more complex to compute but have discussed that most of what we have already seen applies to multiple regression as well.

First steps in the data analysis

In what follows, we will use a dataset of 40 cases generated from a covariance matrix obtained from a subsample of real data we collected, which is about burnout components, work satisfaction, work-family conflict, and organizational commitment in hospitals. There are six attributes in the dataset that we will analyze here; all are self-assessments made by nurses:

  • Commit: Commitment to their hospital (criterion here)
  • Exhaust: Emotional exhaustion (one of the three components of burnout)
  • Depers: Depersonalization (one of the three components of burnout)
  • Accompl: Accomplishment (one of the three components of burnout)
  • WorkSat: Work satisfaction
  • WFC: Work-family conflict

Our goal here is to understand how burnout dimensions and work satisfaction affect commitment of nurses to their hospital.

We start by generating the data and examining the correlation table and significance. Make sure the matcov.txt file is in your working directory before running this code:

1  library(psych)
2  install.packages("MASS"); library(MASS)
3  matcov = unlist(read.csv("matcov.txt", header=F))
4  covs = matrix(matcov, 6, 6)
5  means = c(4.47,14.95,4.87,36.08,5,1.88)
6  set.seed(987)
7  nurses = data.frame(mvrnorm(n=40, means, covs))
8  colnames(nurses)= c("Commit","Exhaus","Depers","Accompl",
9     "WorkSat","WFC")
10  corr.test(nurses)

The output is provided here:

First steps in the data analysis

The values with a probability value lower than 0.05 are significant by common standards. We can see, for instance, that, in this subsample, commitment is significantly correlated with exhaustion, work satisfaction, and work-family conflict, but not with depersonalization and accomplishment. We can also see that the predictors are intercorrelated—that is, they share part of their variance. We will examine whether this constitutes a problem for a regression analysis later.

Let's plot the relationship to see if the relationships indeed seem linear:

plot(nurses)
First steps in the data analysis

Scatterplots of attributes in the provided nurse's dataset

Here, we will only comment on the scatterplots in which commitment is included. We can see that there is visibly a negative linear association between commitment and exhaustion and work-family conflict. There is visibly a positive linear relationship between commitment and work satisfaction. Notice that there are also other relations visible on the plots, such as the visible relation between work-family conflict and exhaustion. From these scatterplots, nothing in the data seems problematic for the relationships we are exploring.

We will include some more regression diagnostics, but you are also encouraged to read about the assumptions of linear regression (for instance, at http://www.basic.northwestern.edu/statguidefiles/mulreg_ass_viol.html), and check whether the data fulfils these assumptions. If the assumptions of regression are violated, it is possible (actually probable) that the results are unreliable. Readers can also read about the detection of multivariate outliers here: https://stat.ethz.ch/education/semesters/ss2012/ams/slides/v2.2.pdf.

Sadly, there is not enough space to check all these here.

Performing the regression

We want to know if there is a relationship between our predictors and the criterion. We first want to know whether the three burnout dimensions predict commitment to the hospital.

We create the model by using the formula syntax as an argument in the lm() function. What is on the left of the tilde (~) sign is the criterion, on the right are the predictors, separated by a plus (+) sign:

model1 = lm(Commit ~ Exhaus + Depers + Accompl, data = nurses)

Let's examine the coefficients and their significance in the summary of the model:

summary(model1)

The following output shows that exhaustion and accomplishment are predictors of commitment to the hospital (look at p-value under Pr(<|t|) or refer to *)—exhaustion negatively (more emotionally exhausted people are less committed) and accomplishment positively (more accomplished people are more committed):

Performing the regression

We can also see that p-value for F-statistic is significant (bottom of the output), and that 55 percent of variance (see Multiple R-squared) is predicted. The adjusted R-squared considers the number of predictors in the calculation of its value. It is recommended that you specify which value you use when reporting the results, or you can also report both values. Here, we can see that Adjusted R-squared is just a bit lower than Multiple R-squared, meaning that the results are not much affected by the number of predictors.

Checking for the normality of residuals

We have seen that it is important that residuals are normally distributed. We can do this visually by plotting, as in the following line of code:

hist(resid(model1), main="Histogram of residuals", 
   xlab="Residuals")

From the preceding output, we might suspect a slight deviation from normality.

Checking for the normality of residuals

Histogram of the residuals

We can test this suspicion with the Shapiro-Wilk test, using the following code:

shapiro.test(resid(model1))

The output follows and shows that the residuals do not significantly depart from normality:

        Shapiro-Wilk normality test
data:  resid(model1)
W = 0.9776, p-value = 0.6001

Let's examine the Q-Q plot (type plot(model1) and click on the RGraphics window until the appropriate plot appears). This leads to a similar conclusion but, as in the iris preceding dataset example, some observations might prove problematic. We briefly discuss robust regression at the end of the chapter. Robust regression does not assume a normal distribution of residuals. For now, we proceed with the following regression diagnostic.

Checking for variance inflation

We also want to check whether there is a problem of variance inflation in our analysis—that is, whether the predictors are correlated a lot (multicollinear). For this purpose, we will rely on the vif() function of the HH package. The function takes the lm formula as an argument:

install.packages("HH"); library(HH)
vif(Commit ~ Exhaus + Depers + Accompl, data = nurses)

The output follows:

Checking for variance inflation

There are several rules-of-thumb to assess this. One is to consider vif values higher than 10 to be problematic, another is to consider a predictor as problematic if the square root of the vif value is higher than 2. This is not the case here, therefore, we consider our data to be non-multicollinear here.

Examining potential mediations and comparing models

Let's now examine whether including work-family conflict and work satisfaction permits to predict an additional part of variance. We first will ask R to fit a second model, and then will compare model1 and model2 using the anova() function:

model2 = lm(Commit ~ Exhaus + Depers + Accompl + WorkSat, 
   data = nurses)
anova(model1,model2)

The following output shows that indeed the second model predicts additional variance in comparison to model1 (see the significance of the F statistic for the comparison (under Pr(>F)):

Here is an analysis of the variance table:

Examining potential mediations and comparing models

We will now examine the second model, as the additional variance predicted is significantly different from 0:

summary(model2)

The output is provided here:

Examining potential mediations and comparing models

This model predicts 70 percent of variance in commitment, which is pretty good. We can see that work satisfaction is a significant predictor of commitment to the hospital, that the unique contribution of accomplishment is no longer significant (there is therefore a potential mediation), and that the contribution of exhaustion has been reduced when including work satisfaction in the model (there is therefore a potential partial mediation). This might be because of a mediation of the relationship between the two burnout components and commitment by job satisfaction. Let's test this relationship:

model3 = lm(WorkSat ~ Exhaus + Depers + Accompl, data = nurses)
summary(model3)

Let's examine the output:

Examining potential mediations and comparing models

We can notice that 51 percent of the variance of job satisfaction is predicted by the burnout components. All three burnout components are significantly related to work satisfaction (p < .05), negatively for emotional exhaustion and depersonalization and positively for personal accomplishment.

In order to ascertain mediation, we need to proceed to Sobel tests. The bda package contains the necessary function, called mediation.test(). Let's try to see whether the effect of exhaustion on commitment is mediated by work satisfaction:

install.packages("bda"); library(bda)
mediation.test(nurses$WorkSat,nurses$Exhaus,nurses$Commit)

In the following output, under Sobel, we can see that p.value is significant, as the presence of work satisfaction in the model decreases the effect of exhaustion, that work satisfaction is significant even though exhaustion is present in the model, and that, because the Sobel test is significant, we can confirm that there is indeed a partial mediation of the effect of exhaustion on commitment by work satisfaction. In other words, exhaustion decreases work satisfaction, and in turn, work satisfaction increases commitment.

Examining potential mediations and comparing models

The value resulting from the Sobel test follows a z distribution. In order to obtain this value, the slope coefficients of the predictor regressed on the mediator (a) are multiplied by the slope coefficient of the mediator regressed on the criterion (b). This value is then divided by the square root of: b squared multiplied by the squared standard error of a plus a squared multiplied by the squared standard error of b. The formula is as follows:

Examining potential mediations and comparing models

Showing this is important, as very often, analysts include dozens or hundreds of predictors in their models without taking into consideration that the included predictors could themselves be related to each other. Readers are therefore advised to check for meaningful relationships between the attributes they intend to include as predictors in regression analyses before drawing conclusions on the final model!

Predicting new data

A particularly interesting use of regression is to examine how well a model predicts new data. This is easily achieved in R. We will first build the dataset named nurses2 in the same way we did for the first dataset:

1  matcov2 = unlist(read.csv("matcov2.txt", header = F))
2  covs2 = matrix(matcov2, 6, 6)
3  means2 = c(4.279, 13.152, 5.156, 39.28, 5.153, 1.875)
4  set.seed(987)
5  nurses2 = data.frame(mvrnorm(n=40, means2, covs2))
6  colnames(nurses2)= c("Commit","Exhaus","Depers","Accompl",
7     "WorkSat","WFC")

To fit new data, we rely on the predict.lm() function:

predicted = predict.lm(model1, nurses2)

This results in the vector of predicted values being assigned to the criterion, which we call predicted.

As we have the real values for the commitment of individuals to the hospital in the second dataset as well, we can examine the correlations between those:

cor.test(predicted, nurses2$Commit)

The following output shows that the correlation between the predicted values and the real values is 0.5766194. This value is significant and might seem pretty good at first sight:

        Pearson's product-moment correlation

data:  predicted and nurses2$Commit
t = 4.3506, df = 38, p-value = 9.848e-05
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.3231561 0.7528925
sample estimates:
      cor 
0.5766194 

Let's square this value to know how much of the variance in the commitment of the individuals of the second sample is predicted by the model:

0.5766194 ^2 *100

The output is 33.24899. This means only 33 percent of the variance in commitment is predicted by the model, compared to 55 percent in the training data!

Now, we can also compute the residuals:

residuals_test = nurses2$Commit - predicted

We are now able to compute the F value for our model. We have seen that the F value is used to assess the overall significance of the model. In our case, the F value is obtained as follows:

  1. First, we need to know the number of degrees of freedom for the model; this is equal to the number of predictors we have, which is 3. We also need the degrees of freedom for the error; this is the number of observations minus the degrees of freedom of the model, minus 1.
  2. We then compute the sum of squares for the model as the sum of squared differences between the predicted values and the mean of the criterion. The sum of squares for the error is obtained as the sum of the squared differences between the observed and the predicted values.
  3. We then compute the mean squares for the model as the sum of squares for the model divided by the degrees of freedom for the model. We compute the mean squares for the error as the sum of squares for the error divided by the degrees of freedom for the error.
  4. Finally, we obtain the F-statistic by dividing the means squares for the model by the mean squares for the error.

The following function does just that:

1  ComputeF = function(predicted, observed, npred) {
2     DFModel = npred  # the number of predictors
3     DFError = length(observed) - DFModel -1
4     SSModel = sum((predicted - mean(observed))^2)
5     SSError = sum((observed - predicted)^2)
6     MSModel = SSModel / DFModel
7     MSError = SSError / DFError
8     F = MSModel / MSError 
9     F
10  }

Let's try, first, with the original model to check whether the function works fine. The output is 14.67, which is the same as the one outputted when we requested the summary of model1:

ComputeF(unlist(model1[5]), nurses$Commit, 3)

Now let's try to see if the new data is predicted well enough by the model:

ComputeF(predicted, nurses2$Commit, 3)

The outputted F value is 10.4842.

We can test this value using the following line of code. The output shows that the threshold F value at a ceiling of 0.05 on the F distribution for our model is 2.866266:

qf(.95, df1=3, df2=36)

We can therefore, trust that our model significantly predicts new data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.135.175