9.6 Using the Model for Estimation and Prediction

If we are satisfied that a useful model has been found to describe the relationship between reaction time and percent of drug in the bloodstream, we are ready for step 5 in our regression modeling procedure: using the model for estimation and prediction.

The most common uses of a probabilistic model for making inferences can be divided into two categories. The first is the use of the model for estimating the mean value of y, E(y), for a specific value of x.

For our drug reaction example, we may want to estimate the mean response time for all people whose blood contains 4% of the drug.

The second use of the model entails predicting a new individual y value for a given x.

That is, we may want to predict the reaction time for a specific person who possesses 4% of the drug in the bloodstream.

In the first case, we are attempting to estimate the mean value of y for a very large number of experiments at the given x value. In the second case, we are trying to predict the outcome of a single experiment at the given x value. Which of these uses of the model—estimating the mean value of y or predicting an individual new value of y (for the same value of x)—can be accomplished with the greater accuracy?

Before answering this question, we first consider the problem of choosing an estimator (or predictor) of the mean (or a new individual) y value. We will use the least squares prediction equation

y^=β^0+β^1x

both to estimate the mean value of y and to predict a specific new value of y for a given value of x. For our example, we found that

y^=.1+.7x

so the estimated mean reaction time for all people when x=4 (the drug is 4% of the blood content) is

y^=.1+.7(4)=2.7seconds

The same value is used to predict a new y value when x=4. That is, both the estimated mean and the predicted value of y are y^=2.7 when x=4, as shown in Figure 9.20.

Figure 9.20

Estimated mean value and predicted individual value of reaction time y for x=4

The difference between these two uses of the model lies in the accuracies of the estimate and the prediction, best measured by the sampling errors of the least squares line when it is used as an estimator and as a predictor, respectively. These errors are reflected in the standard deviations given in the following box:

Sampling Errors for the Estimator of the Mean of y and the Predictor of an Individual New Value of y

  1. The standard deviation of the sampling distribution of the estimator y^ of the mean value of y at a specific value of x, say xp, is

    σy=σ1n+(xpx¯)2SSxx

    where σ is the standard deviation of the random error ε. We refer to σy^ as the standard error of y^.

  2. The standard deviation of the prediction error for the predictor y^ of an individual new y value at a specific value of x is

    σ(yy^)=σ1+1n+(xpx¯)2SSxx

    where σ is the standard deviation of the random error ε. We refer to σ(yy^) as the standard error of prediction.

The true value of σ is rarely known, so we estimate σ by s and calculate the estimation and prediction intervals as shown in the next two boxes:

A 100(1α)% Confidence Interval for the Mean Value of y at x=xp

y^±tα/2(Estimated standard error of y^)

or

y^±tα/2s1n+(xpx¯)2SSxx

where tα/2 is based on (n2) degrees of freedom.

A 100(1α)% Prediction Interval* for an Individual New Value of y at x=xp

y^±tα/2(Estimated standard error of prediction)

or

y^±tα/2s1+1n+(xpx¯)2SSxx

where tα/2 is based on (n2) degrees of freedom.

STIMULUS Example 9.7 Estimating the Mean of y—Drug Reaction Regression

Problem

  1. Refer to the simple linear regression on drug reaction. Find a 95% confidence interval for the mean reaction time when the concentration of the drug in the bloodstream is 4%.

Solution

  1. For a 4% concentration, x=4 and the confidence interval for the mean value of y is

    y^±tα/2s1n+(xpx¯)2SSxx=y^±t.025s15+(4x¯)2SSxx

    where t.025 is based on n2=52=3 degrees of freedom. Recall that y^=2.7, s=.61,x=3, and SSxx=10. From Table III in Appendix A, t.025=3.182. Thus, we have

    2.7±(3.182)(.61)15+(43)210=2.7±(3.182)(.61)(.55)=2.7±(3.182)(.34)=2.7±1.1

    Therefore, when the percentage of drug in the bloodstream is 4%, we can be 95% confident that the mean reaction time for all possible subjects will range from 1.6 to 3.8 seconds.

Look Back

Note that we used a small amount of data (a small sample size) for purposes of illustration in fitting the least squares line. The interval would probably be narrower if more information had been obtained from a larger sample.

Now Work Exercise 9.104ad

STIMULUS Example 9.8 Predicting an Individual Value of y—Drug Reaction Regression

Problem

  1. Refer again to the drug reaction regression. Predict the reaction time for the next performance of the experiment for a subject with a drug concentration of 4%. Use a 95% prediction interval.

Solution

  1. To predict the response time for an individual new subject for whom x=4, we calculate the 95% prediction interval as

    y^±tα/2s1+1n+(xpx¯)2SSxx=2.7±(3.182)(.61)1+15+(43)210=2.7±(3.182)(.61)(1.14)=2.7±(3.182)(.70)=2.7±2.2

    Therefore, when the drug concentration for an individual is 4%, we predict with 95% confidence that the reaction time for this new individual will fall into the interval from .5 to 4.9 seconds.

Look Back

Like the confidence interval for the mean value of y, the prediction interval for y is quite large. This is because we have chosen a simple example (one with only five data points) to fit the least squares line. The width of the prediction interval could be reduced by using a larger number of data points.

Now Work Exercise 9.104e

Both the confidence interval for E(y) and the prediction interval for y can be obtained from a statistical software package. Figure 9.21 is a MINITAB printout showing the confidence interval and prediction interval, respectively, for the data in the drug example.

The 95% confidence interval for E(y) when x=4, highlighted under “95% CI” in Figure 9.21, is (1.645, 3.755). The 95% prediction interval for y when x=4, highlighted in Figure 9.21 under “95% PI,” is (.503, 4.897). These agree with the ones computed in Examples 9.7 and 9.8.

Figure 9.21

MINITAB printout giving 95% confidence interval for E(y) and 95% prediction interval for y

Note that the prediction interval for an individual new value of y is always wider than the corresponding confidence interval for the mean value of y. Will this always be true? The answer is “Yes.” The error in estimating the mean value of y, E(y), for a given value of x, say, xp, is the distance between the least squares line and the true line of means, E(y)=β0+β1x. This error, [y^E(y)], is shown in Figure 9.22. In contrast, the error (ypy^) in predicting some future value of y is the sum of two errors: the error in estimating the mean of y, E(y), shown in Figure 9.22, plus the random error that is a component of the value of y that is to be predicted. (See Figure 9.23.) Consequently, the error in predicting a particular value of y will be larger than the error in estimating the mean value of y for a particular value of x. Note from their formulas that both the error of estimation and the error of prediction take their smallest values when xp=x. The farther xp lies from x, the larger will be the errors of estimation and prediction. You can see why this is true by noting the deviations for different values of xp between the actual line of means E(y)=β0+β1x and the predicted line of means y^=β^0+β^1x shown in Figure 9.23. The deviation is larger at the extremes of the interval, where the largest and smallest values of x in the data set occur.

Figure 9.22

Error in estimating the mean value of y for a given value of x

Both the confidence intervals for mean values and the prediction intervals for new values are depicted over the entire range of the regression line in Figure 9.24. You can see that the confidence interval is always narrower than the prediction interval and that they are both narrowest at the mean x, increasing steadily as the distance |xx¯| increases. In fact, when x is selected far enough away from x so that it falls outside the range of the sample data, it is dangerous to make any inferences about E(y) or y. We call this the problem of extrapolation.

Caution

Using the least squares prediction equation for extrapolation i.e., to estimate the mean value of y or to predict a particular value of y for values of x that fall outside the range of the values of x contained in your sample data, may lead to errors of estimation or prediction that are much larger than expected. Although the least squares model may provide a very good fit to the data over the range of x values contained in the sample, it could give a poor representation of the true model for values of x outside that region.

Figure 9.23

Error in predicting a future value of y for a given value of x

The width of the confidence interval grows smaller as n is increased; thus, in theory, you can obtain as precise an estimate of the mean value of y as desired (at any given x) by selecting a large enough sample. The prediction interval for a new value of y also grows smaller as n increases, but there is a lower limit on its width. If you examine the formula for the prediction interval, you will see that the interval can get no smaller than y^±zα/2σ.* Thus, the only way to obtain more accurate predictions for new values of y is to reduce the standard deviation σ of the regression model. This can be accomplished only by improving the model, either by using a curvilinear (rather than linear) relationship with x or by adding new independent variables to the model (or both). Consult the chapter references to learn more about these methods of improving the model.

Figure 9.24

Confidence intervals for mean values and prediction intervals for new values

Now Work Exercise 9.104f

Statistics in Action Revisited

Using the Straight-Line Model to Predict Pipe Location for the Dowsing Data

The group of German physicists who conducted the dowsing experiments stated that the data for the three “best” dowsers empirically support the dowsing theory. If so, then the straight-line model relating a dowser’s guess (x) to actual pipe location (y) should yield accurate predictions. The MINITAB printout shown in Figure SIA9.5 gives a 95% prediction interval for y when a dowser guesses x=50 meters (the middle of the 100-meter-long waterpipe). The highlighted interval is (9.3,100.23). Thus, we can be 95% confident that the actual pipe location will fall between 9.3 meters and 100.23 meters for this guess. Since the pipe is only 100 meters long, the interval in effect ranges from 0 to 100 meters—the entire length of the pipe! This result, of course, is due to the fact that the straight-line model is not a statistically useful predictor of pipe location, a fact we discovered in the previous Statistics in Action Revisited sections.

Figure SIA9.5

MINITAB prediction interval for dowsing data

Exercises 9.99–9.119

Understanding the Principles

  1. 9.99 Explain the difference between y and E(y) for a given x.

  2. 9.100 True or False. For a given x, a confidence interval for E(y) will always be wider than a prediction interval for y.

  3. 9.101 True or False. The greater the deviation between x and x, the wider the prediction interval for y will be.

  4. 9.102 For each of the following, decide whether the proper inference is a prediction interval for y or a confidence interval for E(y):

    1. A jeweler wants to predict the selling price of a diamond stone on the basis of its size (number of carats).

    2. A psychologist wants to estimate the average IQ of all patients who have a certain income level.

Learning the Mechanics

  1. 9.103 In fitting a least squares line to n=10 data points, the following quantities were computed:

    SSxx=32,x¯=3,SSyy=26,y¯=4,SSxy=28
    1. Find the least squares line.

    2. Graph the least squares line.

    3. Calculate SSE.

    4. Calculate s2.

    5. Find a 95% confidence interval for the mean value of y when xp=2.5.

    6. Find a 95% prediction interval for y when xp=4.

  2. L09104 9.104 Consider the following pairs of measurements.

    Alternate View
    x 1 2 3 4 5 6 7
    y 3 5 4 6 7 7 10
    1. Construct a scatterplot of these data.

    2. Find the least squares line, and plot it on your scatterplot.

    3. Find s2.

    4. Find a 90% confidence interval for the mean value of y when x=4. Plot the upper and lower bounds of the confidence interval on your scatterplot.

    5. Find a 90% prediction interval for a new value of y when x=4. Plot the upper and lower bounds of the prediction interval on your scatterplot.

    6. Compare the widths of the intervals you constructed in parts d and e. Which is wider and why?

  3. L09105 9.105 Consider the following pairs of measurements.

    Alternate View
    x 4 6 0 5 2 3 2 6 2 1
    y 3 5 1 4 3 2 0 4 1 1

    For these data, SSxx=38.900,SSyy=33.600,SSxy=32.8, and y^=.414+.843x.

    1. Construct a scatterplot of the data.

    2. Plot the least squares line on your scatterplot.

    3. Use a 95% confidence interval to estimate the mean value of y when xp=6. Plot the upper and lower bounds of the interval on your scatterplot.

    4. Repeat part c for xp=3.2 and xp=0.

    5. Compare the widths of the three confidence intervals you constructed in parts c and d, and explain why they differ.

  4. 9.106 Refer to Exercise 9.105 .

    1. Using no information about x, estimate and calculate a 95% confidence interval for the mean value of y. [Hint: Use the one-sample t methodology of Section 7.3.]

    2. Plot the estimated mean value and the confidence interval as horizontal lines on your scatterplot.

    3. Compare the confidence intervals you calculated in parts c and d of Exercise 9.107 with the one you calculated in part a of this exercise. Does x appear to contribute information about the mean value of y?

    4. Check the answer you gave in part c with a statistical test of the null hypothesis H0:β1=0 against Ha:β10. Use α=.05.

Applying the Concepts—Basic

  1. 9.107 Do nice guys finish first or last? Refer to the Nature (Mar. 20, 2008) study of the use of punishment in cooperation games, Exercise 9.22 (p. 512). Recall that simple linear regression was used to model a player’s average payoff (y) as a straight-line function of the number of times punishment was used (x) by the player.

    1. If the researchers want to predict average payoff for a single player who used punishment 10 times, how should they proceed?

    2. If the researchers want to estimate the mean of the average payoffs for all players who used punishment 10 times, how should they proceed?

  2. MOON 9.108 Measuring the moon’s orbit. Refer to the American Journal of Physics (Apr. 2014) study of the moon’s orbit, Exercise 9.23 (p. 513). Recall that the angular size (y) of the moon was modeled as a straight-line function of height above horizon (x). A MINITAB printout showing both a 95% prediction interval for y and a 95% confidence interval for E(y) when x=50 degrees is displayed below.

    SAS output for Exercise 9.109

    1. Give a practical interpretation of the 95% prediction interval.

    2. Give a practical interpretation of the 95% confidence interval.

    3. A researcher wants to predict the angular size of the moon when the height above the horizon is 80 degrees. Do you recommend that the researcher use the least squares line shown in the printout to make the prediction? Explain.

  3. 9.109 Mongolian desert ants. Refer to the Journal of Biogeogra­phy (Dec. 2003) study of ant sites in Mongolia, presented in Exercise 9.26 (p. 514). You applied the method of least squares to estimate the straight-line model relating annual rainfall (y) and maximum daily temperature (x). A SAS printout giving 95% prediction intervals for the amount of rainfall at each of the 11 sites is shown at the top of the page. Select the interval associated with site (observation) 7 and interpret it practically.

  4. PGA 9.110 Ranking driving performance of professional golfers. Refer to The Sport Journal (Winter 2007) study of a new method for ranking the total driving performance of golfers on the Professional Golf Association (PGA) tour, presented in Exercise 9.29 (p. 515). You fit a straight-line model relating driving accuracy (y) to driving distance (x). A MINITAB printout with prediction and confidence intervals for a driving distance of x=300 yards is shown below.

    1. Locate the 95% prediction interval for driving accuracy (y) on the printout, and give a practical interpretation of the result.

    2. Locate the 95% prediction interval for mean driving accuracy (y) on the printout, and give a practical interpretation of the result.

    3. If you are interested in knowing the average driving accuracy of all PGA golfers who have a driving distance of 300 yards, which of the intervals is relevant? Explain.

  5. OJUICE 9.111 Sweetness of orange juice. Refer to the simple linear regression of sweetness index y and amount of pectin, x, for n=24 orange juice samples, presented in Exercise 9.32 (p. 516). A 90% confidence interval for the mean sweetness index E(y) for each value of x is shown on the SPSS spreadsheet on the next page. Select an observation and interpret this interval.

  6. NITRO 9.112 Removing nitrogen from toxic wastewater. Highly toxic wastewater is produced during the manufacturing of dry-spun acrylic fiber. One way to lessen toxicity is to remove the nitrogen from the wastewater. A group of environmental engineers investigated a promising method—called anaerobic ammonium oxidation—for nitrogen removal and reported the results in the Chemical Engineering Journal (Apr. 2013). A sample of 120 specimens of toxic wastewater were collected and each treated with the nitrogen removal method. The amount of nitrogen removed (measured in milligrams per liter) was determined as well as the amount of ammonium (milligrams per liter) used in the removal process. These data (simulated from information provided in the journal article) are saved in the NITRO file. The data for the first 5 specimens are shown below. A simple linear regression analysis, where y=amount of nitrogen removed and x=amount of ammonium used, is also shown in the accompanying SAS printout on p. 549.

    1. Assess statistically the adequacy of the fit of the linear model. Do you recommend using the model for predicting nitrogen amount?

    2. On the SAS printout, locate a 95% prediction interval for nitrogen amount when amount of ammonium used is 100 milligrams per liter. Practically interpret the result.

    3. Will a 95% confidence interval for the mean nitrogen amount when amount of ammonium used is 100 milligrams per liter be wider or narrower than the interval, part b? Explain.

    First 5 observations of 120
    Nitrogen Ammonium
    18.87 67.40
    17.01 12.49
    23.88 61.96
    10.45 15.63
    36.03 83.66

    SPSS output for Exercise 9.111

    SAS Output for Exercise 9.112

Applying the Concepts—Intermediate

  1. POLO 9.113 Game performance of water polo players. Refer to the Biology of Sport (Vol. 31, 2014) study of the physiological performance of top-level water polo players, Exercise 9.24 (p. 513). Recall that the in-game heart rate (y, expressed as a percentage of maximum heart rate) of a player was modeled as a straight-line function of the player’s maximal oxygen uptake (x, VO2max). A researcher desires an estimate of the average in-game heart rate of all top-level water polo players who have a maximal oxygen uptake of 150 VO2max.

    1. Which interval is desired by the researcher, a 95% prediction interval for y or a 95% confidence interval for E(y)? Explain.

    2. Use statistical software to compute the desired interval.

    3. Give a practical interpretation of the interval.

  2. BBALL 9.114 Sound waves from a basketball. Refer to the American Journal of Physics (June 2010) study of sound waves in a spherical cavity, Exercise 9.31 (p. 516). You fit a straight-line model relating sound wave frequency (y) to number of resonances (x) for n=24 echoes resulting from striking a basketball with a metal rod.

    1. Use the model to predict the sound wave frequency for the 10th resonance.

    2. Form a 90% confidence interval for the prediction, part a. Interpret the result.

    3. Suppose you want to predict the sound wave frequency for the 30th resonance. What are the dangers in making this prediction with the fitted model?

  3. HEIGHT 9.115 Ideal height of your mate. Refer to the Chance (Summer 2008) study of the height of the ideal mate, Exercise 9.33 (p. 516). The data were used to fit the simple linear regression model, E(y)=β0+β1x, where y=ideal partner’s height (in inches) and x=student's height (in inches). One model was fitted for male students and one model was fitted for female students. Consider a student who is 66 inches tall.

    1. If the student is a female, use the model to predict the height of her ideal mate. Form a 95% confidence interval for the prediction and interpret the result.

    2. If the student is a male, use the model to predict the height of his ideal mate. Form a 95% confidence interval for the prediction and interpret the result.

    3. Which of the two inferences, parts a and b, may be invalid? Why?

  4. NAME2 9.116 The “name game.” Refer to the Journal of Experimental Psychology—Applied (June 2000) name-retrieval study, presented in Exercise 9.34 (p. 517).

    1. Find a 99% confidence interval for the mean recall proportion for students in the fifth position during the “name game.” Interpret the result.

    2. Find a 99% prediction interval for the recall proportion of a particular student in the fifth position during the “name game.” Interpret the result.

    3. Compare the intervals you found in parts a and b. Which interval is wider? Will this always be the case? Explain.

  5. LSPILL 9.117 Spreading rate of spilled liquid. Refer to the Chemicial Engineering Progress (Jan. 2005) study of the rate at which a spilled volatile liquid will spread across a surface, presented in Exercise 9.35 (p. 517). Recall that simple linear regression was used to model y=mass of the spill as a function of y=elapsed time of the spill.

    1. Find a 99% confidence interval for the mean mass of all spills with an elapsed time of 15 minutes. Interpret the result.

    2. Find a 99% prediction interval for the mass of a single spill with an elapsed time of 15 minutes. Interpret the result.

    3. Compare the intervals you found in parts a and b. Which interval is wider? Will this always be the case? Explain.

  6. TWEETS 9.118 Forecasting movie revenues with Twitter. Refer to the IEEE International Conference on Web Intelligence and Intelligent Agent Technology (2010) study of how social media (e.g., Twitter.com) may influence the products consumers buy, Exercise 9.36 (p. 518). Recall that opening weekend box office revenue (in millions of dollars) and tweet rate (average number of tweets referring to the movie per hour) were collected for a sample of 24 recent movies. The data are reproduced in the next table. Use simple linear regression to find a 90% prediction interval for the revenue (y) of a movie with a tweet rate (x) of 150 tweets per hour. Give a practical interpretation of the interval.

Applying the Concepts—Advanced

  1. TOOL 9.119 Life tests of cutting tools. Refer to the data on life tests of cutting tools, Exercise 9.52 (p. 523).

    1. Use a 90% confidence interval to estimate the mean useful life of a brand-A cutting tool when the cutting speed is 45 meters per minute. Repeat for brand B. Compare the widths of the two intervals and comment on the reasons for any difference.

      Data for Exercise 9.118

      Tweet Rate Revenue (millions)
      1365.8 142
      1212.8 77
      581.5 61
      310.1 32
      455 31
      290 30
      250 21
      680.5 18
      150 18
      164.5 17
      113.9 16
      144.5 15
      418 14
      98 14
      100.8 12
      115.4 11
      74.4 10
      87.5 9
      127.6 9
      52.2 9
      144.1 8
      41.3 2
      2.75 0.3
    2. Use a 90% prediction interval to predict the useful life of a brand-A cutting tool when the cutting speed is 45 meters per minute. Repeat for brand B. Compare the widths of the two intervals with each other and with the two intervals you calculated in part a. Comment on the reasons for any differences.

    3. Note that the estimation and prediction you performed in parts a and b were for a value of x that was not included in the original sample. That is, the value x=45 was not part of the sample. However, the value is within the range of x values in the sample, so that the regression model spans the x value for which the estimation and prediction were made. In such situations, estimation and prediction represent interpolations.

      Suppose you were asked to predict the useful life of a brand-A cutting tool for a cutting speed of x=100 meters per minute. Since the given value of x is outside the range of the sample x values, the prediction is an example of extrapolation. Predict the useful life of a brand-A cutting tool that is operated at 100 meters per minute, and construct a 95% confidence interval for the actual useful life of the tool. What additional assumption do you have to make in order to ensure the validity of an extrapolation?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.185.147