Linear regression with R

Linear regressions are traditional statistical models. Regressions are meant to understand how two or more variables are related and/or to make predictions. Taking limitations into account, regressions may answer questions such as: How does the National Product respond to government expenditure in the short run? or What should be the expected revenue for next year?

Of course, there are drawbacks. An obvious one is that linear regression is only meant to grasp linear relations. Plotting variables ahead may give you hints on linearity—sometimes, you can turn things around with data transformation. Note that a relation does not necessarily imply causation.

A strong relation (correlation) could also result from coincidence or spurious relations (also known as third factor or common cause). The latter does not halt your regression as long your intention is to only draw forecasts; nonetheless, true causation is always better.

Imagine that, for some beaches, you can find a strong positive correlation between the number of sodas consumed and drowning. Soda is not causing the drowning but both drowning and sodas are related to the number of swimmers. This is a spurious relation/third factor/common cause. To forbid soda consumption won't stop drowning. Yet, if soda consumption were the only number a policymaker could precisely foresee in a reasonable time horizon, it would be wise to allocate more lifeguards when the soda consumption was about to grow. In the real-world, the calendar (calendar effects) may serve better guidance.

Coincidences are hardly of any use. To check whether you get a coincidence or not, try to predict values from a dataset yet not used to fit the regression. The better results you get from validation and test datasets, the lower chances are to have a coincidence on your hands—(sample) size matters.

In order to imply causation, several precautions must be taken. If you are trusting experimental data, the experiments must be carefully designed. It's much harder to imply causation from observational data but not impossible. For example. causation between cigarretes and cancer were proved using observational data about siblings.

Now, let's get practical. Data will come from the car package; hence first things first, let's make sure that car is already installed:

if(!require(car)){install.packages('car')}

The dataset to be used is car::UN. In order to get the glimpse of the dataset, try the following code:

library(car)
dim(UN)
# [1] 207 7
head(UN)

The last command might output a small 6 x 7 table. Row names are named after countries. For the upcoming analysis, only columns 7, infant mortality (per 1,000 live births), and 4, per capita gross domestic product (GDP) per capita in US Dollars, will be used. As you can see, there are some NA (not available) values. It would be better to filter data ahead so we can see the dimension for reliable data alone:

dt <- (UN[!is.na(UN[,4]) 
          & !is.na(UN[,7]), c(7,4)])
#[1] 193  2

The filtered dataset is stored in the dt object—!is.na() was used to search for available values; the & operator made sure to filter observations simultaneously available for columns 4 and 7. We ended up with 193 observations from the 207 original ones. It would be reckless to not check whether the relation seems to be linear. A simple plot check is very useful:

plot(y = dt[,1],x = dt[,2],
     ylab = 'infant deaths/1,000 live birts',
     xlab = 'GDP per caita (U.S. dollars)')

As a result, we got the following diagram:

Figure 6.1: UN dataset – several countries infant deaths versus GDP per capita

Using the base plot from R, we crafted a visualization that is good enough for preliminar analysis—not so suitable for publication. Based on Figure 6.1, you should never say that the two variables keep a linear relation. As mentioned before, transformations can be useful. The next thing to try is data transformation.

There are dark deep dungeons where data is tortured to obey someones' filthy interests. Data transformation can be very dangerous if a person doesn't really know what they are doing or has bad intentions. Handle transformations cautiously, otherwise, there is a risk of achieving meaningless insights or wicked results. The transformation we are about to do—the logarithmic transformation—is known for adding a bias, which some software is not prepared to deal with.

Transformations must be meaningful at least. Economists are used to applying the logarithmic transformation to a series of GDPs in order to get growing rates:

plot(y = dt[,1], x = log(dt[,2]), 
     ylab = 'infant deaths/1,000 live birts',
     xlab = 'GDP growing rate')

The result can be seen in the following diagram:

Figure 6.2: UN dataset – several countries infant deaths versus GDP (per capita) growing

The latter diagram carries a much more linear relation in comparison to the former one. To make a point, try to apply the log() function four times in a row. I assure you that you are going to end up with a much more linear, and pointless, relation:

# here lies the code for 4 logs in a row
plot(y = dt[,1], x = log(log(log(log(dt[,2])))), 
     ylab = 'infant deaths/1,000 live birts',
     xlab = 'hardly meaningful variable')

The same would apply to differentiation. Since our observations account for single snapshots from several different countries, simply running diff()would be meaningless and pointless. After settling for a helpful transformation, let's store the transformed variable before dealing with data partitioning:

dt$log_gdp <- log(dt$ppgdp)

A new variable called log_gdp was created and stored in the dt DataFrame. The sample()function can be used to split it into estimation and test datasets:

# set.seed(5)
n <- sample(dim(dt)[1], size = 40)
dt_est <- dt[-n,]
dt_tst <- dt[n,]

Our sampling method is quite simple; it trusted seemingly random numbers to split our original observations across estimation and test datasets. The sample() function is sorting 40 numbers from 1 to 193, without replacement. Setting seed is necessary if you seek the exact same results from mine. All of the 40 numbers are stored in an object named n.

The last couple of commands are asking for every row from the dt DataFrame except the ones named by n; this leads to our estimation dataset (or dt_est). The test data (dt_tst) is the one with only the rows named by n. We can finally run our regression using lm():

reg <- lm(infantMortality ~ log_gdp, data = dt_est)

We regressed infantmortality (Y or dependent variable) against log_dpg (X or independent variable).

Regressions in R are made using the lm()function. The first argument (formula) must receive an expression—dependent and independent variables are separated by ~. The subsequent argument is data, bringing the data name for which the expression makes reference to.

It means that we are trying to explain (or forecast) infant mortality using the overall growth rate of GDP per capita. Our equation can be expressed as: Y = a + bX + e, where a and b are estimated parameters and e is the error (also referred to as a residual). To see the parameters a (intercept) and b, we can simply call reg into our console, or we could obtain more detailed information calling summary(reg). The following is what the latter function would output:

Call:
lm(formula = infantMortality ~ log_gdp, data = dt_est)

Residuals:
           Min 1Q Median 3Q Max 
-53.7611427921 -17.3120022326 0.0007549473 11.3677416069 118.2849263781 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|) 
(Intercept) 171.1347249385 10.0014562617 17.11098 < 2.22e-16 ***
log_gdp -17.0245553862 1.2942552335 -13.15394 < 2.22e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 26.348842935 on 151 degrees of freedom
Multiple R-squared: 0.53398820233, Adjusted R-squared: 0.53090203148 
F-statistic: 173.02613143 on 1 and 151 DF, p-value: < 2.22044605e-16

You can see test values (p and t values) for individual and grouped parameters estimation; R-squared is another traditional measure. Another very popular and traditional error measure is the mean squared error (MSE). To calculate it's very simple; actually, in the following, you can find two ways of doing it:

mean(residuals(reg)^2)
mean(reg$residuals^2)
# both outputs the same number
# [1] 685.1862

Speaking of all of the results we've seen until now, our regression doesn't look good. Single parameters and regression as a whole are statistically significant, that is, hypothesis testing rejects the null hypothesis that single parameters are equal to zero and all parameters are equal to zero altogether (F test) with less than 1% significance level. That was the good part; the bad one is displayed by R-squared, which is very low and a greater one is willed. As for MSE, we always seek lower values.

A parenthesis may be opened here. Good metrics are relative. They really depend on the other available options and goals aswell. R-squared and statistical significance is important if you are after causation. If your goal is mere prediction, you don't intend to address a policy from this; out-of-sample performance may be even more important than statistical significance and/or R-squared. To reach for the out-of-sample MSE using our test data, we can trust the following:

out <- predict(reg, newdata = dt_tst)
mean((dt_tst[,1] - out)^2)
# [1] 888.1686

The preceding code addressed a prediction using the estimated regression (reg) and data not used to calibrate the model (data_tst). Next, the actual observed values are subtracted from predicted values, (dt_tst[,1] - out), which leads to a residual. The squared residuals are input as an argument to the mean() function, hence giving us the out-of-sample MSE.

As expected, our out-of-sample MSE is bigger than the in-sample MSE. Set a goal to diminish the gap between in and out-of-sample MSE. Always be suspicious if the out-of-sample error shows a better fit. It's possible that you picked a quite small sample for the mission.

We could have done things a little bit differently. We could have skipped some objects. As a result, the code would become a little less readable. On the other hand, we would get ourselves a much smaller code. Additionally, if we had a much bigger dataset, we would also experience non-negligible efficiency gains. The next section will give more details about this.

Table of Contents for Linear regression with R

Create new playlist

Sign In

Sign Up

Table of Contents for
Linear regression with R