Tricks for lm

There are couple of ways where you can perform the same task. Sometimes in this book, I may do it more efficiently and sometimes not. There are mainly two reasons that will prevent me from trying a more efficient way:

I want to display a more step-wise approach so that newcomers won't feel discouraged
I am not that smart

There is always room for improvement. I encourage the reader to look for it every single time. It's okay if you don't succeed or if you find people with more skills than you. Next, you can see the code that I would have pulled if it weren't for reason#1:

# set.seed(5)
# n <- sample(dim(dt)[1], size = 40)
reg <- lm(infantMortality ~ log(ppgdp), data = dt[-n,])
out <- predict(reg, newdata = dt[n,])

Instead of creating a whole new variable in the dataset, the preceding code will rule the transformation directly into the expression passed to lm(). This direct transformation also works with the dependent variable. Another trick is to use brackets to select the calibration and test/validation datasets; this way we won't duplicate the dataset in order to split it.

You may have noticed already that the lm()function will designate an intercept as default. In order to prevent lm() from doing this, add a (zero) to the right side of the expression, for example, lm(infantMortality ~ log(gdp) + 0, data = dt[-n,]) would do it.

Here, we used a single variable to explain infant mortality—we could have used many more if only we had them. This kind of regression that uses two or more independent variables is called multilinear regression. Sorry to spoil it for you, but running a multilinear regression with R is extremely easy.

The only thing you need to do is to fill your expression with more variables' names on the right side. Imagine if we have far more variables in our original dataset; these would be something like life.expectance, gov.expe, literacy.rate. We could simply do this:

expr <- infantMortality ~ log(gdp) + life.expectance + gov.expe + literacy.rate
m_res <- lm(expr, data = dt[-n,])

Don't try this at your end. We don't actually have these variables so the regression won't work. This rather hypothetical example is designed to exemplify how multilinear regression is done using the lm() function. Before closing up this section, I want to show you a couple of functions that are much more useful when it comes to linear regression:

glm(): It's about the same as lm() but fits generalized linear models instead. The generalized part is made to deal with heteroscedasticity, a regular problem related to linear regressions.
anova(): This sole function, when input with a fitted linear model (or generalized linear model), will give you the popular analysis of variance, also known as the ANOVA table.

Linear regression is a very wide topic and still, it can be seen as the building block for several other models. The current section gave more attention to how to do it with R. Further studies on the topic could be used both to prevent big flaws in your analysis and to reach for more complex techniques. Sticking with this argument, there is a list of topics I wish the reader to consider:

Logistic regressions: This is a model that can be used to handle classification problems.
Experimental statistics: It all goes down to experimental statistics—how tests are designed and made, what they mean (how to interpret them), and how to collect data.
Problems related to linear regression: There are several problems that could ruin a linear regression. The list goes from heteroscedasticity to misspecification. Some of them are bad for forecasting, other ones could disturb relationship analysis, and a few could hurt both or have no purpose at all. Mostly, be careful about using only a few observations and avoid nonsense relations.
ARIMA and GARCH models: Both are models designed to work around time series. The latter is very useful when it comes to analyzing variance or handling things such as frequency trading.

This section focused on the practical aspects of linear regression using R. It's a rather traditional statistical method used to relate two or more variables than a younger model of machine learning. Yet, it's very useful and its simplicity makes it a feasible option for simpler problems, given that the results can be easily interpreted and don't require much work (or time) to craft them.

While doing data science, stay loyal to the force continuum: bigger guns for big problems, smaller guns for small problems.

Moving forward to the next section, we will be exploring tree-based models, which are a good choice for slightly more complicated problems.

Table of Contents for Tricks for lm

Create new playlist

Sign In

Sign Up

Table of Contents for
Tricks for lm