Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

9.2 Fitting the Model: The Least Squares Approach

After the straight-line model has been hypothesized to relate the mean E(y) to the independent variable x, the next step is to collect data and to estimate the (unknown) population parameters, the y-intercept $β_{0}$ $β_{0}$ and the slope $β_{1} .$ $β_{1} .$

Consider the experiment described in Example 9.1 (p. 503), where we want to determine the relationship between the percentage (x) of a certain drug in the bloodstream and the length of time (y) it takes to react to a stimulus. Data were collected for five subjects, and the results are shown in Table 9.1. (The number of measurements and the measurements themselves are unrealistically simple in order to avoid arithmetic confusion in this introductory example.) This set of data will be used to demonstrate the five-step procedure of regression modeling given in the previous section. In the current section, we hypothesize the deterministic component of the model and estimate its unknown parameters (steps 1 and 2). The model’s assumptions and the random-error component (step 3) are the subjects of Section 9.3, whereas Sections 9.4 and 9.5 assess the utility of the model (step 4). Finally, we use the model for prediction and estimation (step 5) in Section 9.6.

Table 9.1 Reaction Time versus Drug Percentage

Subject	Percent `x` of Drug	Reaction Time `y` (seconds)
1	1	1
2	2	1
3	3	2
4	4	2
5	5	4

Data Set: STIMULUS

Step 1 Hypothesize the deterministic component of the probabilistic model. As stated before, we will consider only straight-line models in this chapter. Thus, the complete model relating mean response time E(y) to drug percentage x is given by

$E (y) = β_{0} + β_{1} x$ $E (y) = β_{0} + β_{1} x$
Step 2 Use sample data to estimate unknown parameters in the model. This step is the subject of this section—namely, how can we best use the information in the sample of five observations in Table 9.1 to estimate the unknown y-intercept $β_{0}$ $β_{0}$ and slope $β_{1} ?$ $β_{1} ?$

To determine whether a linear relationship between y and x is plausible, it is helpful to plot the sample data in a scatterplot (or scattergram). Recall (Section 2.8) that a scatterplot locates each data point on a graph, as shown in Figure 9.3 for the five data points of Table 9.1. Note that the scatterplot suggests a general tendency for y to increase as x increases. If you place a ruler on the scatterplot, you will see that a line may be drawn through three of the five points, as shown in Figure 9.4. To obtain the equation of this visually fitted line, note that the line intersects the y-axis at $y = - 1,$ $y = - 1,$ so the y-intercept is $- 1.$ $- 1.$ Also, y increases exactly one unit for every one-unit increase in x, indicating that the slope is $+ 1.$ $+ 1.$ Therefore, the equation is

Visual straight line fitted to the data in Figure 9.3

\tilde{y} = - 1 + 1 (x) = - 1 + x

$\tilde{y} = - 1 + 1 (x) = - 1 + x$

where $\tilde{y}$ $\tilde{y}$ is used to denote the y that is predicted from the visual model.

One way to decide quantitatively how well a straight line fits a set of data is to note the extent to which the data points deviate from the line. For example, to evaluate the model in Figure 9.4, we calculate the magnitude of the deviations (i.e., the differences between the observed and the predicted values of y). These deviations, or errors of prediction, are the vertical distances between observed and predicted values (see Figure 9.4).* The observed and predicted values of y, their differences, and their squared differences are shown in Table 9.2. Note that the sum of errors equals 0 and the sum of squares of the errors (SSE), which places a greater emphasis on large deviations of the points from the line, is equal to 2.

`x`	`y`	$\tilde{y} = - 1 + x$ $\tilde{y} = - 1 + x$	$(y - \tilde{y})$ $(y - \tilde{y})$	${(y - \tilde{y})}^{2}$ ${(y - \tilde{y})}^{2}$
1	1	0	$(1 - 0) = 1$ $(1 - 0) = 1$	1
2	1	1	$(1 - 1) = 0$ $(1 - 1) = 0$	0
3	2	2	$(2 - 2) = 0$ $(2 - 2) = 0$	0
4	2	3	$(2 - 3) = - 1$ $(2 - 3) = - 1$	1
5	4	4	$(4 - 4) = 0$ $(4 - 4) = 0$	0
			$Sum of errors = 0$ $Sum of errors = 0$	$Sum of squared errors (SSE) = 2$ $Sum of squared errors (SSE) = 2$

You can see by shifting the ruler around the graph that it is possible to find many lines for which the sum of errors is equal to 0, but it can be shown that there is one (and only one) line for which the SSE is a minimum. This line is called the least squares line, the regression line, or the least squares prediction equation. The methodology used to obtain that line is called the method of least squares.

Now Work Exercise 9.16a–d

To find the least squares prediction equation for a set of data, assume that we have a sample of n data points consisting of pairs of values of x and y, say, $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n}) .$ $(x_{1}, y_{1}), (x_{2}, y_{2}), \dots, (x_{n}, y_{n}) .$ For example, the $n = 5$ $n = 5$ data points shown in Table 9.2 are (1, 1), (2, 1), (3, 2), (4, 2), and (5, 4). The fitted line, which we will calculate on the basis of the five data points, is written as

\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x

$\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$

The “hats” indicate that the symbols below them are estimates: $\hat{y}$ $\hat{y}$ (y-hat) is an estimator of the mean value of y, E(y), and is a predictor of some future value of y; and ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ are estimators of $β_{0}$ $β_{0}$ and $β_{1},$ $β_{1},$ respectively.

For a given data point—say, the point $(x_{i}, y_{i}),$ $(x_{i}, y_{i}),$ —the observed value of y is $y_{i}$ $y_{i}$ and the predicted value of y would be obtained by substituting $x_{i}$ $x_{i}$ into the prediction equation:

{\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i}

${\hat{y}}_{i} = {\hat{β}}_{0} + {\hat{β}}_{1} x_{i}$

The deviation of the ith value of y from its predicted value is

(y_{i} - {\hat{y}}_{i}) = [y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} x_{i})]

$(y_{i} - {\hat{y}}_{i}) = [y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} x_{i})]$

Then the sum of the squares of the deviations of the y-values about their predicted values for all the n data points is

SSE = \sum {[y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} x_{i})]}^{2}

$SSE = \sum {[y_{i} - ({\hat{β}}_{0} + {\hat{β}}_{1} x_{i})]}^{2}$

The quantities ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ that make the SSE a minimum are called the least squares estimates of the population parameters $β_{0}$ $β_{0}$ and $β_{1},$ $β_{1},$ and the prediction equation $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$ $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$ is called the least squares line.

The least squares line $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$ $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x$ is the line that has the following two properties:

The sum of the errors equals 0, i.e., mean error of prediction $= 0$ $= 0$ .
The sum of squared errors (SSE) is smaller than that for any other straight-line model.

The values of ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ that minimize the SSE are given by the formulas in the following box (proof omitted):*

Formulas for the Least Squares Estimates

Slope: ${\hat{β}}_{1} = \frac{{SS}_{x y}}{{SS}_{x x}}$ ${\hat{β}}_{1} = \frac{{SS}_{x y}}{{SS}_{x x}}$

y-intercept: ${\hat{β}}_{0} = \overline{y} - {\hat{β}}_{1} \overline{x}$ ${\hat{β}}_{0} = \overline{y} - {\hat{β}}_{1} \overline{x}$

where

\begin{array}{l} {SS}_{x y} & = & \sum (x_{i} - \bar{x}) (y_{i} - \bar{y}) = \sum x_{i} y_{i} - \frac{(\sum x_{i}) (\sum y_{i})}{n} \\ {SS}_{x x} & = & \sum {(x_{i} - \bar{x})}^{2} = \sum x_{i}^{2} - \frac{{(\sum x_{i})}^{2}}{n} \\ n & = & Sample size \end{array}

$\begin{array}{l} {SS}_{x y} & = & \sum (x_{i} - \bar{x}) (y_{i} - \bar{y}) = \sum x_{i} y_{i} - \frac{(\sum x_{i}) (\sum y_{i})}{n} \\ {SS}_{x x} & = & \sum {(x_{i} - \bar{x})}^{2} = \sum x_{i}^{2} - \frac{{(\sum x_{i})}^{2}}{n} \\ n & = & Sample size \end{array}$

STIMULUS Example 9.2 Applying the Method of Least Squares—Drug Reaction Data

Problem

Refer to Example 9.1 and the reaction data presented in Table 9.1. Consider the straight-line model $E (y) = β_{0} + β_{1} x,$ $E (y) = β_{0} + β_{1} x,$ where $y = r e a c t i o n$ $y = r e a c t i o n$ time (in seconds) and $x = p e r c e n t$ $x = p e r c e n t$ of drug received.
1. Use the method of least squares to estimate the values of $β_{0}$ $β_{0}$ and $β_{1} .$ $β_{1} .$
2. Predict the reaction time when $x = 2 % .$ $x = 2 % .$
3. Find the SSE for the analysis.
4. Give practical interpretations of ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1} .$ ${\hat{β}}_{1} .$

Solution

We used a spreadsheet program (e.g., Excel) to make the preliminary computations for finding the least squares line. The Excel spreadsheet is shown in Figure 9.5. Using the values in the spreadsheet, we find:

$\begin{array}{l} \bar{x} & = & \frac{\sum x}{5} = \frac{15}{5} = 3 \\ \bar{y} & = & \frac{\sum y}{5} = \frac{10}{5} = 2 \end{array}$ $\begin{array}{l} \bar{x} & = & \frac{\sum x}{5} = \frac{15}{5} = 3 \\ \bar{y} & = & \frac{\sum y}{5} = \frac{10}{5} = 2 \end{array}$

Figure 9.5

Excel Spreadsheet with Calculations

$\begin{array}{l} {SS}_{x y} & = & \sum (x - \bar{x}) (y - \bar{y}) = \sum (x - 3) (y - 2) = 7 \\ {SS}_{x x} & = & \sum {(x - \bar{x})}^{2} = \sum {(x - 3)}^{2} = 10 \end{array}$ $\begin{array}{l} {SS}_{x y} & = & \sum (x - \bar{x}) (y - \bar{y}) = \sum (x - 3) (y - 2) = 7 \\ {SS}_{x x} & = & \sum {(x - \bar{x})}^{2} = \sum {(x - 3)}^{2} = 10 \end{array}$

Then the slope of the least squares line is

${\hat{β}}_{1} = \frac{{SS}_{x y}}{{SS}_{x x}} = \frac{7}{10} = .7$ ${\hat{β}}_{1} = \frac{{SS}_{x y}}{{SS}_{x x}} = \frac{7}{10} = .7$

and the y-intercept is

$\begin{array}{l} {\hat{β}}_{0} & = & \bar{y} - {\hat{β}}_{1} \bar{x} \\ = & \begin{matrix} 2 - (.7) (3) = 2 - 2.1 = - .1 \end{matrix} \end{array}$ $\begin{array}{l} {\hat{β}}_{0} & = & \bar{y} - {\hat{β}}_{1} \bar{x} \\ = & \begin{matrix} 2 - (.7) (3) = 2 - 2.1 = - .1 \end{matrix} \end{array}$

The least squares line is thus

$\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x = - .1 + .7 x$ $\hat{y} = {\hat{β}}_{0} + {\hat{β}}_{1} x = - .1 + .7 x$

The graph of this line is shown in Figure 9.6.

Figure 9.6

The line $\hat{y} = - .1 + .7 x$ $\hat{y} = - .1 + .7 x$ fitted to the data
The predicted value of y for a given value of x can be obtained by substituting into the formula for the least squares line. Thus, when $x = 2,$ $x = 2,$ we predict y to be

$\hat{y} = - .1 + .7 x = - .1 + .7 (2) = 1.3$ $\hat{y} = - .1 + .7 x = - .1 + .7 (2) = 1.3$

We show how to find a prediction interval for y in Section 9.6.
The observed and predicted values of y, the deviations of the y values about their predicted values, and the squares of these deviations are shown in the Excel spreadsheet, Figure 9.7. Note that the sum of the squares of the deviations, SSE, is 1.10 and (as we would expect) this is less than the $S S E = 2.0$ $S S E = 2.0$ obtained in Table 9.2 for the visually fitted line.
The estimated y-intercept, ${\hat{β}}_{0} = - .1,$ ${\hat{β}}_{0} = - .1,$ appears to imply that the estimated mean reaction time is equal to $- .1$ $- .1$ second when the percent x of drug is equal to 0%. Since negative reaction times are not possible, this seems to make the model nonsensical. However, the model parameters should be interpreted only within the sampled range of the independent variable—in this case, for amounts of drug in the bloodstream between 1% and 5%. Thus, the y-intercept—which is, by definition, at $x = 0$ $x = 0$ (0% drug)—is not within the range of the sampled values of x and is not subject to meaningful interpretation.

Figure 9.7

Excel Spreadsheet Comparing Observed and Predicted Values

The slope of the least squares line, ${\hat{β}}_{1} = .7,$ ${\hat{β}}_{1} = .7,$ implies that for every unit increase in x, the mean value of y is estimated to increase by .7 unit. In terms of this example, for every 1% increase in the amount of drug in the bloodstream, the mean reaction time is estimated to increase by .7 second over the sampled range of drug amounts from 1% to 5%. Thus, the model does not imply that increasing the drug amount from 5% to 10% will result in an increase in mean reaction time of 3.5 seconds because the range of x in the sample does not extend to 10% $(x = 10) .$ $(x = 10) .$ In fact, 10% might be such a high concentration that the drug would kill the subject! Be careful to interpret the estimated parameters only within the sampled range of x.

SAS printout for the time–drug regression

SPSS printout for the time–drug regression

Look Back

The calculations required to obtain ${\hat{β}}_{0}, {\hat{β}}_{1},$ ${\hat{β}}_{0}, {\hat{β}}_{1},$ and SSE in simple linear regression, although straightforward, can become rather tedious. Even with the use of a pocket calculator, the process is laborious and susceptible to error, especially when the sample size is large. Fortunately, the use of statistical computer software can significantly reduce the labor involved in regression calculations. The SAS, SPSS, and MINITAB outputs for the simple linear regression of the data in Table 9.1 are displayed in Figure 9.8a–c. The values of ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ are highlighted on the printouts. These values, ${\hat{β}}_{0} = - .1$ ${\hat{β}}_{0} = - .1$ and ${\hat{β}}_{1} = .7,$ ${\hat{β}}_{1} = .7,$ agree exactly with our hand-calculated values. The value of $S S E = 1.10$ $S S E = 1.10$ is also highlighted on the printouts.

MINITAB printout for the time–drug regression

Now Work Exercise 9.27

Interpreting the Estimates of $β_{0}$ $β_{0}$ and $β_{1}$ $β_{1}$ in Simple Linear Regression

y-intercept: ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ represents the predicted value of y when $x = 0$ $x = 0$ . (Caution: This value will not be meaningful if the value $x = 0$ $x = 0$ is nonsensical or outside the range of the sample data.)

slope: ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ represents the increase (or decrease) in y for every 1-unit increase in x. (Caution: This interpretation is valid only for x-values within the range of the sample data.)

Even when the interpretations of the estimated parameters in a simple linear regression are meaningful, we need to remember that they are only estimates based on the sample. As such, their values will typically change in repeated sampling. How much confidence do we have that the estimated slope ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ accurately approximates the true slope $β_{1} ?$ $β_{1} ?$ Determining this requires statistical inference, in the form of confidence intervals and tests of hypotheses, which we address in Section 9.4.

To summarize, we defined the best-fitting straight line to be the line that minimizes the sum of squared errors around it, and we called it the least squares line. We should interpret the least squares line only within the sampled range of the independent variable. In subsequent sections, we show how to make statistical inferences about the model.

Statistics in Action Revisited

Estimating a Straight-Line Regression Model for the Dowsing Data

After conducting a series of experiments in a Munich barn, a group of German physicists concluded that dowsing (i.e., the ability to find underground water with a divining rod) “can be regarded as empirically proven.” This observation was based on the data collected on 3 (of the participating 500) dowsers who had particularly impressive results. All of these “best” dowsers (numbered 99, 18, and 108) performed the experiment multiple times, and the best test series (sequence of trials) for each of them was identified. These data, saved in the DOWSING file, are listed in Table SIA9.1.

Recall (p. 500) that for various hidden pipe locations, each dowser guessed where the pipe with running water was located. Let $x = dowser ’ s$ $x = dowser ’ s$ guess (in meters) and $y = pipe$ $y = pipe$ location (in meters) for each trial. One way to determine whether the “best” dowsers are effective is to fit the straight-line model $E (y) = β_{0} + β_{1} x$ $E (y) = β_{0} + β_{1} x$ to the data in Table SIA9.1.

A MINITAB scatterplot of the data is shown in Figure SIA9.1. The least squares line, obtained from the MINITAB regression printout shown in Figure SIA9.2, is also displayed on the scatterplot. Although the least squares line has a slight upward trend, the variation of the data points around the line is large. It does not appear that a dowser’s guess (x) will be a very good predictor of the actual pipe location (y). In fact, the estimated slope (obtained from Figure SIA9.2) is ${\hat{β}}_{1} = .31 .$ ${\hat{β}}_{1} = .31 .$ Thus, for every 1-meter increase in a dowser’s guess, we estimate that the actual pipe location will increase only .31 meter. In the Statistics in Action Revisited sections that follow, we will provide a measure of reliability for this inference and investigate the phenomenon of dowsing further.

Alternate View

Trial Dowser Number Pipe Location Dowser’s Guess

1 99 4 4

2 99 5 87

3 99 30 95

4 99 35 74

5 99 36 78

6 99 58 65

7 99 40 39

8 99 70 75

9 99 74 32

10 99 98 100

11 18 7 10

12 18 38 40

13 18 40 30

14 18 49 47

15 18 75 9

16 18 82 95

17 108 5 52

18 108 18 16

19 108 33 37

20 108 45 40

21 108 38 66

22 108 50 58

23 108 52 74

24 108 63 65

25 108 72 60

26 108 95 49

Based on Enright, J. T. “Testing dowsing: The failure of the Munich experiments.” Skeptical Inquirer, Jan./Feb. 1999, p. 45 (Figure 6a).

Data Set: DOWSING

Trial	Dowser Number	Pipe Location	Dowser’s Guess
1	99	4	4
2	99	5	87
3	99	30	95
4	99	35	74
5	99	36	78
6	99	58	65
7	99	40	39
8	99	70	75
9	99	74	32
10	99	98	100
11	18	7	10
12	18	38	40
13	18	40	30
14	18	49	47
15	18	75	9
16	18	82	95
17	108	5	52
18	108	18	16
19	108	33	37
20	108	45	40
21	108	38	66
22	108	50	58
23	108	52	74
24	108	63	65
25	108	72	60
26	108	95	49

MINITAB simple linear regression for dowsing data

Exercises 9.15–9.36

Understanding the Principles

9.15 In regression, what is an error of prediction?
9.16 Give two properties of the line estimated with the method of least squares.
9.17 True or False. The estimates of $β_{0}$ $β_{0}$ and $β_{1}$ $β_{1}$ should be interpreted only within the sampled range of the independent variable, x.

Learning the Mechanics

9.18 The accompanying table is used to make the preliminary computations for finding the least squares line for the given pairs of x and y values.

Complete the table.
Find ${SS}_{x y} .$ ${SS}_{x y} .$
Find ${SS}_{x x} .$ ${SS}_{x x} .$
Find ${\hat{β}}_{1} .$ ${\hat{β}}_{1} .$
Find $\overline{x}$ $\overline{x}$ and $\overline{y} .$ $\overline{y} .$
Find ${\hat{β}}_{0} .$ ${\hat{β}}_{0} .$
Find the least squares line.

[Hint: Use the formulas in the box on p. 592.]

	$x_{i}$ $x_{i}$	$y_{i}$ $y_{i}$	$x_{i}^{2}$ $x_{i}^{2}$	$x_{i} y_{i}$ $x_{i} y_{i}$
	7	2	—	—
	4	4	—	—
	6	2	—	—
	2	5	—	—
	1	7	—	—
	1	6	—	—
	3	5	—	—
Totals	$\sum x_{i} =$ $\sum x_{i} =$	$\sum y_{i} =$ $\sum y_{i} =$	$\sum x_{i}^{2} =$ $\sum x_{i}^{2} =$	$\sum x_{i} y_{i} =$ $\sum x_{i} y_{i} =$

9.19 Refer to Exercise 9.18 . After the least squares line has been obtained, the following table can be used (1) to compare the observed and the predicted values of y and (2) to compute SSE.

`x`	`y`	$\hat{y}$ $\hat{y}$	$(y - \hat{y})$ $(y - \hat{y})$	${(y - \hat{y})}^{2}$ ${(y - \hat{y})}^{2}$
7	2	—	—	—
4	4	—	—	—
6	2	—	—	—
2	5	—	—	—
1	7	—	—	—
1	6	—	—	—
3	5	—	—	—
			$\sum (y - \hat{y}) =$ $\sum (y - \hat{y}) =$	$SSE = \sum {(y - \hat{y})}^{2} =$ $SSE = \sum {(y - \hat{y})}^{2} =$

Complete the table.
Plot the least squares line on a scatterplot of the data. Plot the following line on the same graph:

$\hat{y} = 14 - 2.5 x$ $\hat{y} = 14 - 2.5 x$
Show that SSE is larger for the line in part b than it is for the least squares line.

9.20 Construct a scatterplot of the following data.

Alternate View

x .5 1 1.5

y 2 1 3
1. Plot the following two lines on your scatterplot:
  
  $y = 3 - x a n d y = 1 + x$ $y = 3 - x a n d y = 1 + x$
2. Which of these lines would you choose to characterize the relationship between x and y? Explain.
3. Show that the sum of errors for both of these lines equals 0.
4. Which of these lines has the smaller SSE?
5. Find the least squares line for the data, and compare it with the two lines described in part a.
L09021 9.21 Consider the following pairs of measurements.

Alternate View

x 5 3 $- 1$ $- 1$ 2 7 6 4

y 4 3 0 1 8 5 3
1. Construct a scatterplot of these data.
2. What does the scatterplot suggest about the relationship between x and y?
3. Given that ${SS}_{x x} = 43.4286, {SS}_{x y} = 39.8571, \bar{y} = 3.4286,$ ${SS}_{x x} = 43.4286, {SS}_{x y} = 39.8571, \bar{y} = 3.4286,$ and $\overline{x} = 3.7143,$ $\overline{x} = 3.7143,$ calculate the least squares estimates of $β_{0}$ $β_{0}$ and $β_{1} .$ $β_{1} .$
4. Plot the least squares line on your scatterplot. Does the line appear to fit the data well? Explain.
5. Interpret the y-intercept and slope of the least squares line. Over what range of x are these interpretations meaningful?

Applet Exercise 9.1

Use the applet entitled Regression by Eye to explore the relationship between the pattern of data in a scatterplot and the corresponding least squares model.

Run the applet several times. For each time, attempt to move the green line into a position that appears to minimize the vertical distances of the points from the line. Then click Show regression line to see the actual regression line. How close is your line to the actual line? Click New data to reset the applet.
Click the trash can to clear the graph. Use the mouse to place five points on the scatterplot that are approximately in a straight line. Then move the green line to approximate the regression line. Click Show regression line to see the actual regression line. How close were you this time?
Continue to clear the graph, and plot sets of five points with different patterns among the points. Use the green line to approximate the regression line. How close do you come to the actual regression line each time?
On the basis of your experiences with the applet, explain why we need to use more reliable methods of finding the regression line than just “eyeing” it.

Applying the Concepts—Basic

9.22 Do nice guys really finish last? In baseball, there is an old saying that “nice guys finish last.” Is this true in the competitive corporate world? Researchers at Harvard University attempted to answer this question and reported their results in Nature (Mar. 20, 2008). In the study, Boston-area college students repeatedly played a version of the game “prisoner’s dilemma,” where competitors choose cooperation, defection, or costly punishment. (Cooperation meant paying 1 unit for the opponent to receive 2 units; defection meant gaining 1 unit at a cost of 1 unit for the opponent; and punishment meant paying 1 unit for the opponent to lose 4 units.) At the conclusion of the games, the researchers recorded the average payoff and the number of times punishment was used for each player. A graph of the data is shown in the accompanying scatterplot.
1. Consider punishment use (x) as a predictor of average payoff (y). Based on the scatterplot, is there evidence of a linear trend?
2. Refer to part a. Is the slope of the line relating punishment use (x) to average payoff (y) positive or negative?
3. The researchers concluded that “winners don’t punish.” Do you agree? Explain.
MOON 9.23 Measuring the moon’s orbit. A handheld digital camera was used to photograph the moon’s orbit and the results summarized in the American Journal of Physics (Apr. 2014). The pictures were used to measure the angular size (in pixels) of the moon at various distances (heights) above the horizon (measured in degrees). The data for 13 different heights are illustrated in the graph below and saved in the MOON file.
1. Is there visual evidence of a linear trend between angular size (y) and height above horizon (x)? If so, is the trend positive or negative? Explain.
2. Draw what you believe is the best-fitting line through the data.
3. Draw vertical lines from the actual data points to the line, part b. Measure these deviations and then compute the sum of squared deviations for the visually fitted line.
4. An SAS simple linear regression printout for the data is shown in the next column. Compare the y-intercept and slope of the regression line to the visually fitted line, part b.
5. Locate SSE on the printout. Compare this value to the result in part c. Which value is smaller?
  
  SAS Output for Exercise 9.23
POLO 9.24 Game performance of water polo players. The journal Biology of Sport (Vol. 31, 2014) published a study of the physiological performance of top-level water polo players. Eight Olympic male water polo players participated in the study. Two variables were measured for each during competition: $y = m e a n$ $y = m e a n$ heart rate over the four quarters of the game (expressed as a percentage of maximum heart rate) and $x = m a x i m a l$ $x = m a x i m a l$ oxygen uptake (VO₂max). The data (simulated, based on information provided in the article) are shown in the accompanying table. The researchers conducted a simple linear regression analysis of the data. A MINITAB printout of the analysis follows.

Player HR% VO₂Max

1 55 148

2 54 157

3 70 160

4 67 179

5 74 179

6 77 180

7 78 194

8 85 197
1. Give the equation of the least squares line.
2. Give a practical interpretation (if possible) of the y-intercept of the line. No practical interpretation
3. Give a practical interpretation (if possible) of the slope of the line.
BTYPE 9.25 New method for blood typing. Refer to the Analytical Chemistry (May 2010) study in which medical researchers tested a new method of typing blood using low-cost paper, Exercise 2.163 (p. 96). The researchers applied blood drops to the paper and recorded the rate of absorption (called blood wicking). The table gives the wicking lengths (millimeters) for six blood drops, each at a different antibody concentration. Let $y = wicking length$ $y = wicking length$ and $x = antibody concentration$ $x = antibody concentration$ .

Droplet Length (mm) Concentration

1 22.50 0.0

2 16.00 0.2

3 13.50 0.4

4 14.00 0.6

5 13.75 0.8

6 12.50 1.0

Based on Khan, M. S., et al. “Paper diagnostic for instant blood typing.” Analytical Chemistry, Vol. 82, No. 10, May 2010 (Figure 4b).
1. Give the equation of the straight-line model relating y to x.
2. An SPSS printout of the simple linear regression analysis is shown below. Give the equation of the least squares line.
3. Give practical interpretations (if possible) of the estimated y-intercept and slope of the line.

Player	HR%	VO₂Max
1	55	148
2	54	157
3	70	160
4	67	179
5	74	179
6	77	180
7	78	194
8	85	197

Droplet	Length (mm)	Concentration
1	22.50	0.0
2	16.00	0.2
3	13.50	0.4
4	14.00	0.6
5	13.75	0.8
6	12.50	1.0

ANTS 9.26 Mongolian desert ants. Refer to the Journal of Biogeography (Dec. 2003) study of ants in Mongolia, presented in Exercise 2.167 (p. 97). Data on annual rainfall, maximum daily temperature, and number of ant species recorded at each of 11 study sites are listed in the table.

Alternate View

Site Region Annual Rainfall (mm) Max. Daily Temp. (°C) Number of Ant Species

1 Dry Steppe 196 5.7 3

2 Dry Steppe 196 5.7 3

3 Dry Steppe 179 7.0 52

4 Dry Steppe 197 8.0 7

5 Dry Steppe 149 8.5 5

6 Gobi Desert 112 10.7 49

7 Gobi Desert 125 11.4 5

8 Gobi Desert 99 10.9 4

9 Gobi Desert 125 11.4 4

10 Gobi Desert 84 11.4 5

11 Gobi Desert 115 11.4 4

Based on Pfeiffer, M., et al. “Community organization and species richness of ants in Mongolia along an ecological gradient from steppe to Gobi desert.” Journal of Biogeography, Vol. 30, No. 12, Dec. 2003 (Tables 1 and 2).

Site	Region	Annual Rainfall (mm)	Max. Daily Temp. (°C)	Number of Ant Species
1	Dry Steppe	196	5.7	3
2	Dry Steppe	196	5.7	3
3	Dry Steppe	179	7.0	52
4	Dry Steppe	197	8.0	7
5	Dry Steppe	149	8.5	5
6	Gobi Desert	112	10.7	49
7	Gobi Desert	125	11.4	5
8	Gobi Desert	99	10.9	4
9	Gobi Desert	125	11.4	4
10	Gobi Desert	84	11.4	5
11	Gobi Desert	115	11.4	4

Consider a straight-line model relating annual rainfall (y) and maximum daily temperature (x). A MINITAB printout of the simple linear regression is shown below. Give the least squares prediction equation.
Construct a scatterplot for the analysis you performed in part a. Include the least squares line on the plot. Does the line appear to be a good predictor of annual rainfall?
Now consider a straight-line model relating number of ant species (y) to annual rainfall (x). On the basis of the MINITAB printout below. repeat parts a and b.

9.27 Redshifts of quasi-stellar objects. Astronomers call a shift in the spectrum of galaxies a “redshift.” A correlation between redshift level and apparent magnitude (i.e., brightness on a logarithmic scale) of a quasi-stellar object was discovered and reported in the Journal of Astrophysics & Astronomy (Mar./Jun. 2003). Physicist D. Basu (Carleton University, Ottawa) applied simple linear regression to data collected for a sample of over 6,000 quasi-stellar objects with confirmed redshifts. The analysis yielded the following results for a specific range of magnitudes: $\hat{y} = 18.13 + 6.21 x,$ $\hat{y} = 18.13 + 6.21 x,$ where $y = m a g n i t u d e$ $y = m a g n i t u d e$ and $x = r e d s h i f t$ $x = r e d s h i f t$ level.
1. Graph the least squares line. Is the slope of the line positive or negative?
2. Interpret the estimate of the y-intercept in the words of the problem.
3. Interpret the estimate of the slope in the words of the problem.

Applying the Concepts—Intermediate

H₂OPIPE 9.28 Repair and replacement costs of water pipes. Pipes used in a water distribution network are susceptible to breakage due to a variety of factors. When pipes break, engineers must decide whether to repair or replace the broken pipe. A team of civil engineers used regression analysis to estimate $y = the ratio$ $y = the ratio$ of repair to replacement cost of commercial pipe in the IHS Journal of Hydraulic Engineering (Sept. 2012). The independent variable in the regression analysis was $x = the diameter$ $x = the diameter$ (in millimeters) of the pipe. Data for a sample of 13 different pipe sizes are provided in the table, followed by a MINITAB simple linear regression printout.

Diameter Ratio

80 6.58

100 6.97

125 7.39

150 7.61

200 7.78

250 7.92

300 8.20

350 8.42

400 8.60

450 8.97

500 9.31

600 9.47

700 9.72

Source: Suribabu, C. R., and Neelakantan, T. R. “Sizing of water distribution pipes based on performance measure and breakage-repair replacement economics.” IHS Journal of Hydraulic Engineering, Vol. 18, No. 3, Sept. 2012 (Table 1).
1. Find the least squares line relating ratio of repair to replacement cost (y) to pipe diameter (x) on the printout.
2. Locate the value of SSE on the printout. Is there another line with an average error of 0 that has a smaller SSE than the line, part a? Explain.
3. Interpret practically the values ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ .

Diameter	Ratio
80	6.58
100	6.97
125	7.39
150	7.61
200	7.78
250	7.92
300	8.20
350	8.42
400	8.60
450	8.97
500	9.31
600	9.47
700	9.72

PGA 9.29 Ranking driving performance of professional golfers. Refer to The Sport Journal (Winter 2007) study of a new method for ranking the total driving performance of golfers on the Professional Golf Association (PGA) tour, presented in Exercise 2.66 (p. 62). Recall that the method computes a driving performance index based on a golfer’s average driving distance (yards) and driving accuracy (percent of drives that land in the fairway). Data for the top 40 PGA golfers (as ranked by the new method) are saved in the PGA file. (The first five and last five observations are listed in the next table.)

Write the equation of a straight-line model relating driving accuracy (y) to diving distance (x).

Alternate View

Rank Player Driving Distance (yards) Driving Accuracy (%) Driving Performance Index

1 Woods 316.1 54.6 3.58

2 Perry 304.7 63.4 3.48

3 Gutschewski 310.5 57.9 3.27

4 Wetterich 311.7 56.6 3.18

5 Hearn 295.2 68.5 2.82

$⋮$ $⋮$ $⋮$ $⋮$ $⋮$ $⋮$ $⋮$ $⋮$ $⋮$ $⋮$

36 Senden 291 66 1.31

37 Mickelson 300 58.7 1.30

38 Watney 298.9 59.4 1.26

39 Trahan 295.8 61.8 1.23

40 Pappas 309.4 50.6 1.17

Based on Frederick Wiseman, Ph.D., Mohamed Habibullah, Ph.D., and Mustafa Yilmaz, Ph.D, Sports Journal, Vol. 10, No. 1.

Use simple linear regression to fit the model you found in part a to the data. Give the least squares prediction equation.
Interpret the estimated y-intercept of the line.
Interpret the estimated slope of the line.
A professional golfer, practicing a new swing to increase his average driving distance, is concerned that his driving accuracy will be lower. Which of the two estimates, y-intercept or slope, will help you determine whether the golfer’s concern is a valid one? Explain.

Rank	Player	Driving Distance (yards)	Driving Accuracy (%)	Driving Performance Index
1	Woods	316.1	54.6	3.58
2	Perry	304.7	63.4	3.48
3	Gutschewski	310.5	57.9	3.27
4	Wetterich	311.7	56.6	3.18
5	Hearn	295.2	68.5	2.82
$⋮$ $⋮$	$⋮$ $⋮$	$⋮$ $⋮$	$⋮$ $⋮$	$⋮$ $⋮$
36	Senden	291	66	1.31
37	Mickelson	300	58.7	1.30
38	Watney	298.9	59.4	1.26
39	Trahan	295.8	61.8	1.23
40	Pappas	309.4	50.6	1.17

FCAT 9.30 FCAT scores and poverty. In the state of Florida, elementary school performance is based on the average score obtained by students on a standardized exam called the Florida Comprehensive Assessment Test (FCAT). An analysis of the link between FCAT scores and sociodemographic factors was published in the Journal of Educational and Behavioral Statistics (Spring 2004). Data on average math and reading FCAT scores of third graders, as well as the percentage of students below the poverty level, for a sample of 22 Florida elementary schools are listed in the table on p. 516.

Propose a straight-line model relating math score (y) to percentage (x) of students below the poverty level.
Use the method of least squares to fit the model to the data in the FCAT file.
Graph the least squares line on a scatterplot of the data. Is there visual evidence of a relationship between the two variables? Is the relationship positive or negative?
Interpret the estimates of the y-intercept and slope in the words of the problem.

Now consider a model relating reading score (y) to percentage (x) of students below the poverty level. Repeat parts a–d for this model.

Alternate View

Elementary School FCAT—Math FCAT—Reading % Below Poverty

1 166.4 165.0 91.7

2 159.6 157.2 90.2

3 159.1 164.4 86.0

4 155.5 162.4 83.9

5 164.3 162.5 80.4

6 169.8 164.9 76.5

7 155.7 162.0 76.0

8 165.2 165.0 75.8

9 175.4 173.7 75.6

10 178.1 171.0 75.0

11 167.1 169.4 74.7

12 177.0 172.9 63.2

13 174.2 172.7 52.9

14 175.6 174.9 48.5

15 170.8 174.8 39.1

16 175.1 170.1 38.4

17 182.8 181.4 34.3

18 180.3 180.6 30.3

19 178.8 178.0 30.3

20 181.4 175.9 29.6

21 182.8 181.6 26.5

22 186.1 183.8 13.8

Based on Tekwe, C. D., et al. “An empirical comparison of statistical models for value-added assessment of school performance.” Journal of Educational and Behavioral Statistics, Vol. 29, No. 1, Spring 2004 (Table 2).

Elementary School	FCAT—Math	FCAT—Reading	% Below Poverty
1	166.4	165.0	91.7
2	159.6	157.2	90.2
3	159.1	164.4	86.0
4	155.5	162.4	83.9
5	164.3	162.5	80.4
6	169.8	164.9	76.5
7	155.7	162.0	76.0
8	165.2	165.0	75.8
9	175.4	173.7	75.6
10	178.1	171.0	75.0
11	167.1	169.4	74.7
12	177.0	172.9	63.2
13	174.2	172.7	52.9
14	175.6	174.9	48.5
15	170.8	174.8	39.1
16	175.1	170.1	38.4
17	182.8	181.4	34.3
18	180.3	180.6	30.3
19	178.8	178.0	30.3
20	181.4	175.9	29.6
21	182.8	181.6	26.5
22	186.1	183.8	13.8

BBALL 9.31 Sound waves from a basketball. Refer to the American Journal of Physics (June 2010) study of sound waves in a spherical cavity, Exercise 2.43 (p. 52). The frequencies of sound waves (estimated using a mathematical formula) resulting from the first 24 resonances (echoes) after striking a basketball with a metal rod are reproduced in the following table. Recall that the researcher expects the sound wave frequency to increase as the number of resonances increases.

Resonance	Frequency
1	979
2	1572
3	2113
4	2122
5	2659
6	2795
7	3181
8	3431
9	3638
10	3694
11	4038
12	4203
13	4334
14	4631
15	4711
16	4993
17	5130
18	5210
19	5214
20	5633
21	5779
22	5836
23	6259
24	6339

Based on Russell, D. A. “Basketballs as spherical acoustic cavities.” American Journal of Physics, Vol. 48, No. 6, June 2010 (Table I).

Hypothesize a model for frequency (y) as a function of number of resonances (x) that proposes a linearly increasing relationship.
According to the researcher’s theory, will the slope of the line be positive or negative?
Estimate the beta parameters of the model and (if possible) give a practical interpretation of each.

OJUICE 9.32 Sweetness of orange juice The quality of the orange juice produced by a manufacturer is constantly monitored. There are numerous sensory and chemical components that combine to make the best-tasting orange juice. For example, one manufacturer has developed a quantitative index of the “sweetness” of orange juice. (The higher the index, the sweeter is the juice.) Is there a relationship between the sweetness index and a chemical measure such as the amount of water-soluble pectin (parts per million) in the orange juice? Data collected on these two variables during 24 production runs at a juice-manufacturing plant are shown in the table. Suppose a manufacturer wants to use simple linear regression to predict the sweetness (y) from the amount of pectin (x).

Find the least squares line for the data.
Interpret ${\hat{β}}_{0}$ ${\hat{β}}_{0}$ and ${\hat{β}}_{1}$ ${\hat{β}}_{1}$ in the words of the problem.
Predict the sweetness index if the amount of pectin in the orange juice is 300 ppm. [Note: A measure of reliability of such a prediction is discussed in Section 9.6.]

Run	Sweetness Index	Pectin (ppm)
1	5.2	220
2	5.5	227
3	6.0	259
4	5.9	210
5	5.8	224
6	6.0	215
7	5.8	231
8	5.6	268
9	5.6	239
10	5.9	212
11	5.4	410
12	5.6	256
13	5.8	306
14	5.5	259
15	5.3	284
16	5.3	383
17	5.7	271
18	5.5	264
19	5.7	227
20	5.3	263
21	5.9	232
22	5.8	220
23	5.8	246
24	5.9	241

Note: The data in the table are authentic. For reasons of confidentiality, the name of the manufacturer cannot be disclosed.

HEIGHT 9.33 Ideal height of your mate. Anthropologists theorize that humans tend to choose mates who are similar to themselves. This includes choosing mates who are similar in height. To test this theory, a study was conducted on 147 Cornell University students (Chance, Summer 2008). Each student was asked to select the height of his/her ideal spouse or life partner. The researchers fit the simple linear regression model, $E (y) = β_{0} + β_{1} x$ $E (y) = β_{0} + β_{1} x$ , where $y = i d e a l$ $y = i d e a l$ partner’s height (in inches) and $x =$ $x =$ student’s height (in inches). The data for the study (simulated from information provided in a scatterplot) are saved in the HEIGHT file. The accompanying table lists selected observations from the full data set.

Gender	Actual Height	Ideal Height
F	59	66
F	60	70
F	60	72
F	61	65
F	61	67
$⋮$ $⋮$	$⋮$ $⋮$	$⋮$ $⋮$
M	73.5	66
M	74	67
M	74	68
M	74	69
M	74	70

Based on Lee, G., Velleman, P., and Wainer, H. “Giving the finger to dating services.” Chance, Vol. 21, No. 3, Summer 2008 (adapted from Figure 3).

The researchers found the estimated slope of the line to be negative. Fit the model to the data using statistical software and verify this result.
The negative slope was interpreted as follows: “The taller the respondent was, the shorter they felt their ideal partner ought to be.” Do you agree?
The result, part b, contradicts the theory developed by anthropologists. To gain insight into this phenomenon, use a scatterplot to graph the full data set. Use a different plotting symbol for male and female students. Now focus on just the data for the female students. What trend do you observe? Repeat for male students.
Fit the straight-line model to the data for the female students. Interpret the estimated slope of the line.
Repeat part d for the male students.
Based on the results, parts d and e, comment on whether the study data support the theory developed by anthropologists.

NAME2 9.34 The “name game.” Refer to the Journal of Experimental Psychology—Applied (June 2000) study in which the “name game” was used to help groups of students learn the names of other students in the group, Exercise7.120 (p. 434). [In the “name game” student #1 states his/her full name, student #2 his/her name and the name of the first student, student #3 his/her name and the names of the first two students, etc.] After a 30-minute seminar, all students were asked to remember the full name of each of the other students in their group, and the researchers measured the proportion of names recalled for each. One goal of the study was to investigate the linear trend between $y =$ $y =$ proportion of names recalled and $x =$ $x =$ position (order) of the student during the game. The data (simulated on the basis of summary statistics provided in the research article) for 144 students in the first eight positions are saved in the NAME2 file. The first five and last five observations in the data set are listed in the next table. [Note: Since the student in position 1 actually must recall the names of all the other students, he or she is assigned position number 9 in the data set.] Use the method of least squares to estimate the line $E (y) = β_{0} + β_{1} x .$ $E (y) = β_{0} + β_{1} x .$ Interpret the $β$ $β$ estimates in the words of the problem.

Position	Recall
2	0.04
2	0.37
2	1.00
2	0.99
2	0.79
$⋮$ $⋮$	$⋮$ $⋮$
9	0.72
9	0.88
9	0.46
9	0.54
9	0.99

Based on Morris, P. E., and Fritz, C. O. “The name game: Using retrieval practice to improve the learning of names.” Journal of Experimental Psychology—Applied, Vol. 6, No. 2, June 2000 (data simulated from Figure 2).

Applying the Concepts—Advanced

Time (minutes)	Mass (pounds)
0	6.64
1	6.34
2	6.04
4	5.47
6	4.94
8	4.44
10	3.98
12	3.55
14	3.15
16	2.79
18	2.45
20	2.14
22	1.86
24	1.60
26	1.37
28	1.17
30	0.98
35	0.60
40	0.34
45	0.17
50	0.06
55	0.02
60	0.00

Based on Barry, J. “Estimating rates of spreading and evaporation of volatile liquids.” Chemical Engineering Progress, Vol. 101, No. 1, Jan. 2005.

SPILL 9.35 Spreading rate of spilled liquid. Refer to the Chemical Engineering Progress (Jan. 2005) study of the rate at which a spilled volatile liquid will spread across a surface, presented in Exercise 2.168 (p. 97). Recall that a DuPont Corp. engineer calculated the mass (in pounds) of a 50-gallon methanol spill after a period ranging from 0 to 60 minutes. Do the data shown in the accompanying table indicate that the mass of the spill tends to diminish as time increases? If so, how much will the mass diminish each minute?

TWEETS 9.36 Forecasting movie revenues with Twitter. Marketers are keenly interested in how social media (e.g., Facebook, Twitter) may influence consumers who buy their products. Researchers at HP Labs (Palo Alto, CA) investigated whether the volume of chatter on Twitter.com could be used to forecast the box office revenues of movies (IEEE International Conference on Web Intelligence and Intelligent Agent Technology, 2010). Opening weekend box office revenue data (in millions of dollars) were collected for a sample of 24 recent movies. In addition, the researchers computed each movie’s tweet rate, i.e., the average number of tweets (at Twitter.com) referring to the movie per hour one week prior to the movie’s release. The data (simulated based on information provided in the study) are listed in the table. Assuming that movie revenue and tweet rate are linearly related, how much do you estimate a movie’s opening weekend revenue to change as the tweet rate for the movie increases by an average of 100 tweets per hour?

Tweet Rate	Revenue (millions)
1365.8	142
1212.8	77
581.5	61
310.1	32
455	31
290	30
250	21
680.5	18
150	18
164.5	17
113.9	16
144.5	15
418	14
98	14
100.8	12
115.4	11
74.4	10
87.5	9
127.6	9
52.2	9
144.1	8
41.3	2
2.75	0.3

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for
9.2 Fitting the Model: The Least Squares Approach

9.2 Fitting the Model: The Least Squares Approach

Table 9.1 Reaction Time versus Drug Percentage

Figure 9.3

Figure 9.4

Table 9.2 Comparing Observed and Predicted Values for the Visual Model

Formulas for the Least Squares Estimates

STIMULUS Example 9.2 Applying the Method of Least Squares—Drug Reaction Data

Problem

Solution

Figure 9.5

Figure 9.6

Figure 9.7

Figure 9.8a

Figure 9.8b

Look Back

Figure 9.8c

Interpreting the Estimates of $β_{0}$ $β_{0}$ and $β_{1}$ $β_{1}$ in Simple Linear Regression

Statistics in Action Revisited

Table SIA9.1 Dowsing Trial Results: Best Series for the Three Best Dowsers

Figure SIA9.1

Figure SIA9.2

Exercises 9.15–9.36

Understanding the Principles

Learning the Mechanics

Applet Exercise 9.1

Applying the Concepts—Basic

Applying the Concepts—Intermediate

Data for Exercise 9.30

Applying the Concepts—Advanced

Table of Contents for 9.2 Fitting the Model: The Least Squares Approach

Create new playlist

Sign In

Sign Up

Table of Contents for
9.2 Fitting the Model: The Least Squares Approach