Making predictions with regression algorithms

Since the dawn of time, human beings have tried to foresee the future. How many rainy days will there be in the next week? What will be the harvest for next season? These types of problems are regression problems. The goal is to predict the value of a continuous response variable. This is also a supervised learning task.

Regression analysis is a statistical process of studying the relationship between a set of independent variables (explanatory variables) and the dependent variable (response variable). Through this technique, it is possible to understand how the value of the response variable changes when the explanatory variable is varied.

Consider a group of bikes about which some information has been collected: number of years of use, number of kilometers traveled in one year, and number of falls. Through these techniques, we can find that on average, when the number of kilometers traveled increases, the number of falls also increases. By increasing the number of years of motorcycle usage and so increasing the experience, the number of falls tends to decrease.

A regression analysis can be conducted for dual purposes:

  • Explanatory: To understand and weigh the effects of the independent variable on the dependent variable according to a particular theoretical model

  • Predictive: To locate a linear combination of the independent variable so as to predict the value assumed by the dependent variable—optimally

To find the relationship between the variables, we can choose to describe the observation behavior by means of a mathematical function that, upon interpolating the data, can represent its tendency and keep its main information. The linear regression method consists of precisely identifying a line that is capable of representing point distribution in a two-dimensional plane. As it is easy to imagine, if the points corresponding to the observations are near the line, then the chosen model will be able to describe the link between the variables effectively.

In theory, there are an infinite number of lines that may interpolate the observations. In practice, there is only one mathematical model that optimizes the representation of the data. In the case of a linear mathematical relationship, the observations of the variable y can be obtained by a linear function of observations of the variable x. For each observation, we will have:

y = α * x + β

In this formula, x is the explanatory variable and y is the response variable. Parameters α and β, which represent respectively the intercept with the y axis and the slope of the line, must be estimated based on the observations collected for the two variables included in the model. The following graph shows an example of a linear regression line.

The values of the ordinate can be predicted starting from those present on the abscissa with good approximation. In fact, all the observations are very close to the regression line:

Of particular interest is the slope β, that is, the variation of the mean response for every single increment of the explanatory variable. What about a change in this coefficient? If the slope is positive, the regression line increases from left to right (as shown in the preceding graph); if the slope is negative, the line decreases from left to right. When the slope is zero, the explanatory variable has no effect on the value of the response. But it is not just the sign of β that establishes the weight of the relationship between the variables; more generally, its value is also important. In the case of a positive slope, the mean response is higher when the explanatory variable is higher; in the case of a negative slope, the mean response is lower when the explanatory variable is higher.

Regression techniques are workhorses of machine learning algorithms. The example of simple linear regression that we have just seen is the simplest case treated by this class of algorithms. Actually, much more complex problems can be faced with these techniques. This is due to the fact that the name regression refers to a large family of machine learning algorithms.

The most popular regression algorithms are:

  • Ordinary least squares regression
  • Linear regression
  • Logistic regression
  • Stepwise regression
  • Multivariate adaptive regression splines

Each algorithm allows us to solve a specific class of problems; it is worth remembering that in any case, the goal is to predict the value of a continuous response variable.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.253.62