Application - regression and curve fitting

Since we are talking about the application of linear algebra, our experience comes from real-world cases. Let's begin with linear regression. So, let's say we are curious about the relationship between the age of a person and his/her sleeping quality. We'll use the data available online from the Great British Sleep Survey 2012 (https://www.sleepio.com/2012report/).

There were 20,814 people who took the survey, in an age range from under 20 to over 60 years old, and they evaluated their sleeping quality by scores from 4 to 6.

In this practice, we will just use 100 as our total population and simulate the age and sleeping scores followed the same distribution as the survey results. We want to know whether their age grows, sleep quality (scores) increases or decreases? As you already know, this is a hidden linear regression practice. Once we drew the regression line of the age and sleeping scores, by looking at the slope of the line, the answer will just come up.

But before we talk about which NumPy function we should use and how we use it, let's create the dataset first. From the survey, we know there are 7% of participants under 20, 24% between 21 and 30, 21% between 31 and 40, and 21% over 60. So we first create a group list to represent the number of people in each age group and use numpy.random.randint() to simulate the real age among our 100 population, to see the age variable. Now we know the distribution of sleeping scores based on each age group, which we called scores: it's a list of [5.5, 5.7, 5.4, 4.9, 4.6, 4.4], the order according to the age group from youngest to the oldest. Here we also use the np.random.rand() function with the mean (from the scores list) and the standard variance (all set to 0.01) to simulate the score distribution (of course, if you have a good dataset you want to play with, it would be better to just use the numpy.genfromtxt() function we introduced in the previous chapter):

In [83]: groups = [7, 24, 21, 19, 17, 12] 
 
In [84]: age = np.concatenate([np.random.randint((ind + 1)*10, (ind + 2)*10, group) for ind, group in enumerate(groups)]) 
 
In [85]: age 
Out[85]: 
array( 
[11, 15, 12, 17, 17, 18, 12, 26, 29, 24, 28, 25, 27, 25, 26, 24, 23,  27, 26, 24, 27, 20, 28, 20, 22, 21, 23, 25, 27, 24, 25, 35, 39, 33, 35, 30, 32, 32, 36, 38, 31, 35, 38, 31, 37, 36, 39, 30, 36, 33, 36, 37, 45, 41, 44, 48, 45, 40, 44, 42, 47, 46, 47, 42, 42, 42, 44, 40, 40, 47, 47, 57, 56, 53, 53, 57, 54, 55, 53, 52, 54, 57, 53, 58, 58, 54, 57, 55, 64, 67, 60, 63, 68, 65, 66, 63, 67, 64, 68, 66] 
) 
In [86]: scores = [5.5, 5.7, 5.4, 4.9, 4.6, 4.4] 
In [87]: sim_scores = np.concatenate([.01 * np.random.rand(group) + scores[ind] for ind, group in enumerate(groups)] ) 
 
In [88]: sim_scores 
Out[88]: 
array([ 
5.5089,  5.5015,  5.5024,  5.5   ,  5.5033,  5.5019,  5.5012, 
5.7068,  5.703 ,  5.702 ,  5.7002,  5.7084,  5.7004,  5.7036, 
5.7055,  5.7024,  5.7099,  5.7009,  5.7013,  5.7093,  5.7076, 
5.7029,  5.702 ,  5.7067,  5.7007,  5.7004,  5.7   ,  5.7017, 
5.702 ,  5.7031,  5.7087,  5.4079,  5.4082,  5.4083,  5.4025, 
5.4008,  5.4069,  5.402 ,  5.4071,  5.4059,  5.4037,  5.4004, 
5.4024,  5.4058,  5.403 ,  5.4041,  5.4075,  5.4062,  5.4014, 
5.4089,  5.4003,  5.4058,  4.909 ,  4.9062,  4.9097,  4.9014, 
4.9097,  4.9023,  4.9   ,  4.9002,  4.903 ,  4.9062,  4.9026, 
4.9094,  4.9099,  4.9071,  4.9058,  4.9067,  4.9005,  4.9016, 
4.9093,  4.6041,  4.6031,  4.6016,  4.6021,  4.6079,  4.6046, 
4.6055,  4.609 ,  4.6052,  4.6005,  4.6017,  4.6091,  4.6073, 
4.6029,  4.6012,  4.6062,  4.6098,  4.4014,  4.4043,  4.4013, 
4.4091,  4.4087,  4.4087,  4.4027,  4.4017,  4.4067,  4.4003, 
4.4021,  4.4061]) 

Now we have the age and sleeping scores and each variable has 100 incidents. Next, we will calculate the regression line: y = mx + c, where y represents sleeping_score, and x represents age. The NumPy function for the regression line is numpy.linalg.lstsq() and it takes the coefficient matrix and dependent variable values as inputs. So the first thing we need to do is to pack the variable age into a coefficient matrix, which we call AGE:

In [87]: AGE = np.vstack([age, np.ones(len(age))]).T 
In [88]: m, c = np.linalg.lstsq(AGE, sim_scores)[0] 
In [89]: m 
Out[90]: -0.029435313781 
In [91]: c 
Out[92]: 6.30307651938 

Now we have the slope m and constant c. Our regression line is y = -0.0294x + 6.3031, which shows that, when people grow older, there is a slight decrease in their sleeping scores/quality, as you can see in the following graph:

Application - regression and curve fitting

You may think that the regression line equation looks familiar. Remember the first linear equation we solved in the matrix section? Yes, you can also use numpy.linalg.lstsq() to solve the Ax = b equation, and actually it will be the fourth solution in this chapter. Try it by yourself; the usage is very similar to when you used numpy.linalg.solve().

However, not every question can simply be answered by drawing a regression line, such as the housing price by year. It's apparently not a linear relation, and is probably a squared or third-degree relation. So how do we solve such a problem? Let's use the statistical data from the House Price Indices (Office for National Statistics, http://ons.gov.uk/ons/taxonomy/index.html?nscl=House+Price+Indices#tab-data-tables ) and pick the years 2004 to 2013. We have the average house price (in £GBP) adjusted by inflation; we want to know the average price for next year.

Before we go for the solution, let's analyze the question first. Underlying the question is a polynomial curve fitting problem; we want to find the best-fit polynomial for our questions, but which NumPy function should we choose for it? But before doing that, let's create two variables: the price by each year, price, and the year of the house, year:

In [93]: year = np.arange(1,11) 
In [94]: price = np.array([129000, 133000, 138000, 144000, 142000, 141000, 150000, 135000, 134000, 139000]). 
In [95]: year 
Out[95]: array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10]) 

Now we have the year and price data, let's assume their relation is squared. Our goal is to find the polynomial: y = ax2 + bx + c to represent the relations (a typical least-squares approach). y represents price at  year. Here we will use numpy.polyfit() to help us find the coefficients for this polynomial:

In [97]: a, b, c = np.polyfit(year, price, 2) 
In [98]: a 
Out[98]: -549.242424242 
In [99]: b 
Out[99]: 6641.66666667 
In [100]: c 
Out[100]: 123116.666667 
In [101]: a*11**2 + b*11 + c 
Out[101]: 129716.66666666642 

We have all the coefficients for the polynomial from numpy.polyfit(), which takes three input parameters: the first one stands for the independent variable: year; the second one is the dependent variable: price; and the last one is the degree of the polynomial, which in this case is 2. Now we just need to use year = 11 (11 years from 2004), then the estimated price can be calculated. You can see the result in the following graph:

Application - regression and curve fitting

There are many applications from linear algebra that NumPy can achieve such as interpolation and extrapolation, but we can't cover them all in this chapter. We hope this chapter is a good start for you to use NumPy to solve linear or polynomial problems.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.107.229