Regression

Regression is similar to interpolation. In this case, we assume that the data is imprecise, and we require an object of predetermined structure to fit the data as closely as possible. The most basic example is univariate polynomial regression to a sequence of points. We obtain that with the polyfit command, which we discussed briefly in the Univariate polynomials section of this chapter. For instance, if we want to compute the regression line in the least-squares sense for a sequence of 10 uniformly spaced points in the interval (0, π/2) and their values under the sin function, we will issue the following commands:

>>> import numpy
>>> import scipy
>>> import matplotlib.pyplot as plt
>>> x=numpy.linspace(0,1,10)
>>> y=numpy.sin(x*numpy.pi/2)
>>> line=numpy.polyfit(x,y,deg=1)
>>> plt.plot(x,y,'.',x,numpy.polyval(line,x),'r')
>>> plt.show()

This gives the following plot that shows linear regression with polyfit:

Regression

Curve fitting is also possible with splines if we use the parameters wisely. For example, in the case of univariate spline fitting that we introduced before, we can play around with the weights, smoothing factor, the degree of the smoothing spline, and so on. If we want to fit a parabolic spline for the same data as the previous example, we could issue the following commands:

>>> import numpy
>>> import scipy.interpolate
>>> import matplotlib.pyplot as plt
>>> x=numpy.linspace(0,1,10)
>>> y=numpy.sin(x*numpy.pi/2)
>>> spline=scipy.interpolate.UnivariateSpline(x,y,k=2)
>>> xn=numpy.linspace(0,1,100)
>>> plt.plot(x,y,'.', xn, spline(xn))
>>> plt.show()

This gives the following graph that shows curve fitting with splines:

Regression

For regression from the point of view of curve fitting, there is a generic routine: curve_fit in the scipy.optimize module. This routine minimizes the sum of squares of a set of equations using the Levenberg-Marquardt algorithm and offers a best fit from any kind of functions (not only polynomials or splines). The syntax is simple:

curve_fit(f, xdata, ydata, p0=None, sigma=None, **kw)

The f parameter is a callable function that represents the function we seek, and xdata and ydata are arrays of the same length that contain the x and y coordinates of the points to be fit. The tuple p0 holds an initial guess for the values to be found, and sigma is a vector of weights that could be used instead of the standard deviation of the data, if necessary.

We will show its usage with a good example. We will start by generating some points on a section of a sine wave with amplitude A=18, angular frequency w=3π, and phase h=0.5. We corrupt the data in the array y with some small random noise:

>>> import numpy
>>> import scipy
>>> A=18; w=3*numpy.pi; h=0.5
>>> x=numpy.linspace(0,1,100); y=A*numpy.sin(w*x+h)
>>> y += 4*((0.5-scipy.rand(100))*numpy.exp(2*scipy.rand(100)**2))

We want to estimate the values of A, w, and h from the corrupted data, hence technically finding a curve fit from the set of sine waves. We start by gathering the three parameters in a list and initializing them to some values, for example, A = 20, w = 2π, and h = 1. We also construct a callable expression of the target function (target_function):

>>> import scipy.optimize
>>> p0 = [20, 2*numpy.pi, 1]
>>> target_function = lambda x,AA,ww,hh: AA*numpy.sin(ww*x+hh)

We feed these, together with the fitting data, to curve_fit in order to find the required values:

>>> pF,pVar = scipy.optimize.curve_fit(target_function, x, y, p0)

A sample of pF run on any of our experiments should give an accurate result for the three requested values:

>>> print (pF)

The output for the preceding command is as follows:

[ 18.13799397   9.32232504   0.54808516]

This means that A was estimated to about 18.14, w was estimated very close to 3π, and h was between 0.46 and 0.55. The output of the initial data together with a computation of the sine wave is as follows, in which original data (in blue on the left-hand side graph), corrupted (in red in both graphs), and computed sine wave (in black in the right-hand side) are shown in following plots:

Regression

The code is too long to be included here. Instead, the full code (intermediate plots that are produced are not shown here) can be found in the corresponding electronic resource IPython Notebook for this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.36.38