How it works...

We declared two Scala arrays, parallelized them into two RDDs that are separate vectors of x() and y(). We then used the zip() method from the RDD API to produce a paired (that is, zipped) RDD. It results in an RDD in which each member is an (x , y) pair. We then proceed to calculate the mean, sum, and so on, and apply the closed form formula as described to find the intercept and slope for the regression line.

In Spark 2.0, the alternative would have been to use the GLM API out of the box. It is worth mentioning that the maximum number of parameters for a closed normal form scheme supported by GLM is limited to 4,096.

We used a closed form formula to demonstrate that a regression line associated with a set of numbers (Y1, X1), ..., (Yn, Xn) is simply the line that minimizes the sum of the square errors. In a simple regression equation, the line is as follows:

  • Slope of the regression line 
  • Offset of the regression line 
  • The equation for the regression line 

A regression line is simply the best fit line that minimizes the sum of the square error. For a set of points (dependent variable, independent variable), there are many lines that can pass through these points and capture the general linear relationship, but only one of those lines is the line that minimizes all the errors from such a fit.

For the example, we presented the line Y = 1.21 + .9153145 * X. Shown in the following figure is such a line and we computed the slope and the offset with a closed form formula. The linear model depicted by the linear equation of a line represents our best linear model (slope=.915345, intercept= 1.21) for the given data using closed form formulas:

The data points plotted in the preceding figure are as follows:

(Y, X)
(5.0, 1.0) (8.0, 4.0) (10.0, 11.0) (15.0, 25.0) (21.0, 18.0) (27.0, 33.0) (30.0, 20.0) (38.0, 30.0) (45.0, 43.0) (50.0, 55.0) (64.0, 57.0)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.172.159