5.5 Determining the Sample Size

Recall from Section 1.5 that one way to collect the relevant data for a study used to make inferences about a population is to implement a designed (planned) experiment. Perhaps the most important design decision faced by the analyst is to determine the size of the sample. We show in this section that the appropriate sample size for making an inference about a population mean or proportion depends on the desired reliability.

Estimating a Population Mean

Consider Example 5.1 (p. 255), in which we estimated the mean length of stay for patients in a large hospital. A sample of 100 patients’ records produced the 95% confidence interval x¯±1.96σx¯=4.5±.78. Consequently, our estimate x¯ was within .78 day of the true mean length of stay, μ, for all the hospital’s patients at the 95% confidence level. That is, the 95% confidence interval for μ was 2(.78)=1.56 days wide when 100 accounts were sampled. This is illustrated in Figure 5.16a.

Figure 5.16

Relationship between sample size and width of confidence interval: hospital-stay example

Now suppose we want to estimate μ to within .25 day with 95% confidence. That is, we want to narrow the width of the confidence interval from 1.56 days to .50 day, as shown in Figure 5.16b. How much will the sample size have to be increased to accomplish this? If we want the estimator x¯ to be within .25 day of μ, then we must have

1.96σx¯=.25or,equivalently,1.96(σn)=.25

Note that we are using σx¯=σn in the formula, since we are dealing with the sampling distribution of x¯ (the estimator of μ).

The necessary sample size is obtained by solving this equation for n. First, we need the value of σ. In Example 5.1, we assumed that the standard deviation of length of stay was σ=4 days. Thus,

1.96(σn)=1.96(4n)=.25n=1.96(4).25=31.36n=(31.36)2=983.45

Consequently, over 983 patients’ records will have to be sampled to estimate the mean length of stay, μ, to within .25 day with (approximately) 95% confidence. The confidence interval resulting from a sample of this size will be approximately .50 day wide. (See Figure 5.16b.)

In general, we express the reliability associated with a confidence interval for the population mean μ by specifying the sampling error within which we want to estimate μ with 100(1α)% confidence. The sampling error (denoted SE) is then equal to the half-width of the confidence interval, as shown in Figure 5.17.

Figure 5.17

Specifying the sampling error SE as the half-width of a confidence interval

The procedure for finding the sample size necessary to estimate μ with a specific sampling error is given in the following box. Note that if σ is unknown (as is usually the case, in practice), you will need to estimate the value of σ.

Determination of Sample Size for 100(1-α)% Confidence Intervals for μ

In order to estimate μ with a sampling error SE and with 100(1α)% confidence, the required sample size is found as follows:

zα/2(σn)=SE

The solution for n is given by the equation

n=(zα/2)2σ2(SE)2

Note: The value of σ is usually unknown. It can be estimated by the standard deviation s from a previous sample. Alternatively, we may approximate the range R of observations in the population and (conservatively) estimate σR/4. In any case, you should round the value of n obtained upward to ensure that the sample size will be sufficient to achieve the specified reliability.

Example 5.9 Sample Size for Estimating μ— Mean Inflation Pressure of Footballs

Problem

  1. Suppose the manufacturer of official NFL footballs uses a machine to inflate the new balls to a pressure of 13.5 pounds. When the machine is properly calibrated, the mean inflation pressure is 13.5 pounds, but uncontrollable factors cause the pressures of individual footballs to vary randomly from about 13.3 to 13.7 pounds. For quality control purposes, the manufacturer wishes to estimate the mean inflation pressure to within .025 pound of its true value with a 99% confidence interval. What sample size should be specified for the experiment?

Solution

  1. We desire a 99% confidence interval that estimates μ with a sampling error of SE=.025pound. For a 99% confidence interval, we have zα/2=z.005=2.575. To estimate σ, we note that the range of observations is R=13.713.3=.4 and we use σR/4=.1. Next, we employ the formula derived in the box to find the sample size n:

    n=(zα/2)2σ2(SE)2(2.575)2(.1)2(.025)2=106.09

    We round this up to n=107. Realizing that σ was approximated by R/4, we might even advise that the sample size be specified as n=110 to be more certain of attaining the objective of a 99% confidence interval with a sampling error of .025 pound or less.

Look Back

To determine the value of the sampling error SE, look for the value that follows the key words “estimate μ to within .”

Now Work Exercise 5.84

Sometimes the formula will lead to a solution that indicates a small sample size is sufficient to achieve the confidence interval goal. Unfortunately, the procedures and assumptions for small samples differ from those for large samples, as we discovered in Section 5.3. Therefore, if the formulas yield a small sample size, one simple strategy is to select a sample size n=30.

Estimating a Population Proportion

The method just outlined is easily applied to a population proportion p. To illustrate, in Example 5.6 (p. 275) a pollster used a sample of 1,000 U.S. citizens to calculate a 95% confidence interval for the proportion who trust the president, obtaining the interval .637±.03. Suppose the pollster wishes to estimate more precisely the proportion who trust the president, say, to within .015 with a 95% confidence interval.

The pollster wants a confidence interval for p with a sampling error SE=.015. The sample size required to generate such an interval is found by solving the following equation for n:

zα/2σp^=SEorzα/2pqn=.015
(see Figure 5.18)

Figure 5.18

Specifying the sampling error SE of a confidence interval for a population proportion p

Since a 95% confidence interval is desired, the appropriate z value is zα/2=z.025=1.96. We must approximate the value of the product pq before we can solve the equation for n. As shown in Table 5.6 (p. 276), the closer the values of p and q to .5, the larger is the product pq. Thus, to find a conservatively large sample size that will generate a confidence interval with the specified reliability, we generally choose an approximation of p close to .5. In the case of the proportion of U.S. citizens who trust the president, however, we have an initial sample estimate of p^=.637. A conservatively large estimate of pq can therefore be obtained by using, say, p=.60. We now substitute into the equation and solve for n:

1.96(.60)(.40)n=.015n=(1.96)2(.60)(.40)(.015)2=4,097.74,098

The pollster must sample about 4,098 U.S. citizens to estimate the percentage who trust the president with a confidence interval of width .03.

The procedure for finding the sample size necessary to estimate a population proportion p with a specified sampling error SE is given in the following box:

Determination of Sample Size for 100(1-α)% Confidence Interval for p

In order to estimate a binomial probability p with sampling error SE and with 100(1α)% confidence, the required sample size is found by solving the following equation for n:

zα/2pqn=SE

The solution for n can be written as follows:

n=(zα/2)2(pq)(SE)2

Note: Because the value of the product pq is unknown, it can be estimated by the sample fraction of successes, p^, from a previous sample. Remember (Table 5.6) that the value of pq is at its maximum when p equals .5, so you can obtain conservatively large values of n by approximating p by .5 or values close to .5. In any case, you should round the value of n obtained upward to ensure that the sample size will be sufficient to achieve the specified reliability.

Example 5.10 Sample Size for Estimating p — Fraction of Defective Cell Phones

Problem

  1. A cellular telephone manufacturer that entered the post-regulation market quickly has an initial problem with excessive customer complaints and consequent returns of cell phones for repair or replacement. The manufacturer wants to estimate the magnitude of the problem in order to design a quality control program. How many cellular telephones should be sampled and checked in order to estimate the fraction defective, p, to within .01 with 90% confidence?

Solution

  1. In order to estimate p to within .01 of its true value, we set the half-width of the confidence interval equal to SE=.01, as shown in Figure 5.19.

    The equation for the sample size n requires an estimate of the product pq. We could most conservatively estimate pq=.25 (i.e., use p=.5), but this estimate may be too conservative. By contrast, a value of .1, corresponding to 10% defective, will probably be conservatively large for this application. The solution is therefore

    n=(zα/2)2(pq)(SE)2=(1.645)2(.1)(.9)(.01)2=2,435.42,436

    Thus, the manufacturer should sample 2,436 telephones in order to estimate the fraction defective, p, to within .01 with 90% confidence.

Figure 5.19

Specified reliability for estimate of fraction defective in Example 5.10

Look Back

Remember that this answer depends on our approximation of pq, for which we used .09. If the fraction defective is closer to .05 than .10, we can use a sample of 1,286 telephones (check this) to estimate p to within .01 with 90% confidence.

Now Work Exercise 5.88

The cost of sampling will also play an important role in the final determination of the sample size to be selected to estimate either μ or p. Although more complex formulas can be derived to balance the reliability and cost considerations, we will solve for the necessary sample size and note that the sampling budget may be a limiting factor. (Consult the references for a more complete treatment of this problem.) Once the sample size n is determined, be sure to devise a sampling plan that will ensure that a representative sample is selected from the target population.

Ethics in Statistics

In sampling, intentional omission of experimental units (e.g., respondents in a survey) in order to bias the results toward a particular view or outcome is considered unethical statistical practice.

Statistics in Action Revisited

Determining Sample Size

In the previous Statistics in Action applications in this chapter, we used confidence intervals (1) to estimate μ, the mean overpayment amount for claims in a Medicare fraud study, and (2) to estimate p, the coding error rate (i.e., proportion of claims that are incorrectly coded) of a Medicare provider. Both of these confidence intervals were based on selecting a random sample of 52 claims from the population of claims handled by the Medicare provider. How does the USDOJ determine how many claims to sample for auditing?

Consider the problem of estimating the coding error rate, p. As stated in a previous Statistics in Action Revisited, the USDOJ typically finds that about 50% of the claims in a Medicare fraud case are incorrectly coded. Suppose the USDOJ wants to estimate the true coding error rate of a Medicare provider to within .1 with 95% confidence. How many claims should be randomly sampled for audit in order to attain the desired estimate?

Here, the USDOJ desires a sampling error of SE=.1, a confidence level of 1α=.95 (for which zα/2=1.96), and uses an estimate p.50. Substituting these values into the sample size formula (p. 283), we obtain.

n=(zα/2)2(pq)/(SE)2=(1.96)2(.5)(.5)/(.1)2=96.04

Consequently, the USDOJ should audit about 97 randomly selected claims to attain a 95% confidence interval for p with a sampling error of .10.

[Note: You may wonder why the sample actually used in the fraud analysis included only 52 claims. The sampling strategy employed involved more than selecting a simple random sample; rather, it used a more sophisticated sampling scheme, called stratified random sampling. The 52 claims represented the sample for just one of the strata.]

Exercises 5.74–5.98

Understanding the Principles

  1. 5.74 How does the sampling error SE compare with the width of a confidence interval?

  2. 5.75 True or false. For a specified sampling error SE, increasing the confidence level (1α) will lead to a larger n in determining the sample size. 

  3. 5.76 True or false. For a fixed confidence level (1α), increasing the sampling error SE will lead to a smaller n in determining the sample size. 

Learning the Mechanics

  1. 5.77 If you wish to estimate a population mean to within .2 with a 95% confidence interval and you know from previous sampling that σ2 is approximately equal to 5.4, how many observations would you have to include in your sample? 

  2. 5.78 If nothing is known about p, .5 can be substituted for p in the sample-size formula for a population proportion. But when this is done, the resulting sample size may be larger than needed. Under what circumstances will using p=.5 in the sample-size formula yield a sample size larger than is needed to construct a confidence interval for p with a specified bound and a specified confidence level?

  3. 5.79 Suppose you wish to estimate a population mean correct to within .15 with a confidence level of .90. You do not know σ2, but you know that the observations will range in value between 31 and 39.

    1. Find the approximate sample size that will produce the desired accuracy of the estimate. You wish to be conservative to ensure that the sample size will be ample for achieving the desired accuracy of the estimate. [Hint: Using your knowledge of data variation from Section 2.4, assume that the range of the observations will equal 4σ.]

    2. Calculate the approximate sample size, making the less conservative assumption that the range of the observations is equal to 6σ.

  4. 5.80 In each case, find the approximate sample size required to construct a 95% confidence interval for p that has sampling error SE=.06.

    1. Assume that p is near .3.

    2. Assume that you have no prior knowledge about p, but you wish to be certain that your sample is large enough to achieve the specified accuracy for the estimate.

  5. 5.81 The following is a 90% confidence interval for p: (.26, .54). How large was the sample used to construct this interval?

  6. 5.82 It costs you $10 to draw a sample of size n=1 and measure the attribute of interest. You have a budget of $1,200.

    1. Do you have sufficient funds to estimate the population mean for the attribute of interest with a 95% confidence interval 4 units in width? Assume that σ=12.

    2. If a 90% confidence level were used, would your answer to part a change? Explain.

  7. 5.83 Suppose you wish to estimate the mean of a normal population with a 95% confidence interval and you know from prior information that σ21.

    1. To see the effect of the sample size on the width of the confidence interval, calculate the width of the confidence interval for n=16,25,49,100,and400.

    2. Plot the width as a function of sample size n on graph paper. Connect the points by a smooth curve, and note how the width decreases as n increases.

Applying the Concepts—Basic

  1. 5.84 Giraffes have excellent vision. Refer to the African Zoology (Oct. 2013) study of a giraffe’s eyesight, Exercise5.38  (p. 272). Recall that the researchers measured the eye mass for a sample of 27 giraffes native to Zimbabwe, Africa, and found x¯=53.4grams and s=8.6grams. Suppose the objective is to sample enough giraffes in order to obtain an estimate of the mean eye mass to within 3 grams of its true value with a 99% confidence interval.

    1. Identify the confidence coefficient for this study.

    2. Identify the desired sampling error for this study.

    3. Find the sample size required to obtain the desired estimate of the true mean.

  2. SHAFTS 5.85 Shaft graves in ancient Greece. Refer to the American Journal of Archaeology (Jan. 2014) study of shaft graves in ancient Greece, Exercise 5.46 (p. 273). Recall that you estimated μ, the average number of shafts buried in ancient Greece graves, using data collected for 13 recently discovered grave sites and a 90% confidence interval. However, you would like to reduce the width of the interval for μ.

    1. Will increasing the confidence level to .95 reduce the width of the interval?

    2. Will increasing the sample size reduce the width of the interval?

    3. Determine the sample size required to estimate μ to within .5 shaft with 90% confidence.

  3. 5.86 Risk of home burglary in cul-de-sacs. Research published in the Journal of Quantitative Criminology (Mar. 2010) revealed that the risk of burglaries in homes located on cul-de-sacs is lower than for homes on major roads. Suppose you want to estimate the true percentage of cul-de-sac homes in your home city that were burglarized in the past year. Devise a sampling plan so that your estimate will be accurate to within 2% of the true value using a confidence coefficient of 95%. How many cul-de-sac homes need to be sampled and what information do you need to collect for each sampled home? 

  4. 5.87 Lobster trap placement. Refer to the Bulletin of Marine Science (Apr. 2010) study of lobster trap placement, Exercise 5.41 (p. 273). Recall that you used a 95% confidence interval to estimate the mean trap spacing (in meters) for the population of red spiny lobster fishermen fishing in Baja California Sur, Mexico. How many teams of fishermen would need to be sampled in order to reduce the width of the confidence interval to 5 meters? Use the sample standard deviation from Exercise 5.41 in your calculation. 

  5. 5.88 Aluminum cans contaminated by fire. A gigantic warehouse located in Tampa, Florida, stores approximately 60 million empty aluminum beer and soda cans. Recently, a fire occurred at the warehouse. The smoke from the fire contaminated many of the cans with blackspot, rendering them unusable. A University of South Florida statistician was hired by the insurance company to estimate p, the true proportion of cans in the warehouse that were contaminated by the fire. How many aluminum cans should be randomly sampled to estimate the true proportion to within .02 with 90% confidence? 

  6. 5.89 Do social robots walk or roll? Refer to the International Conference on Social Robotics (Vol. 6414, 2010) study of the trend in the design of social robots, Exercise 5.67 (p. 281). Recall that you used a 99% confidence interval to estimate the proportion of all social robots designed with legs but no wheels. How many social robots would need to be sampled in order to estimate the proportion to within .075 of its true value?

Applying the Concepts—Intermediate

  1. 5.90 Duration of daylight in western Pennsylvania. Refer to the Naval Oceanography Portal data on number of minutes of daylight per day in Sharon, PA, Exercise 5.43 (p. 273). An estimate of the mean number of minutes of daylight per day was obtained using data collected for 12 randomly selected days (one each month) in a recent year.

    1. Determine the number of days that need to be sampled in order to estimate the desired mean to within 45 minutes of its true value with 95% confidence.

    2. Based on your answer, part a, develop a sampling plan that will likely result in a random sample that is representative of the population.

    3. Go to the Web site, http://aa.usno.navy.mil/USNO/astronomical-applications/data-services, and collect the data for Sharon, PA using your sampling plan.

    4. Use the data, part c, to construct a 95% confidence interval for the desired mean. Does your interval have the desired width?

  2. 5.91 Pitch memory of amusiacs. Refer to the Advances in Cognitive Psychology (Vol. 6, 2010) study of pitch memory of amusiacs, Exercise 5.45 (p. 273). Recall that diagnosed amusiacs listened to a series of tone pairs and were asked to determine if the tones were the same or different. In the first trial, the tones were separated by 1 second; in the second trial, the tones were separated by 5 seconds. The variable of interest was the difference between scores on the two trials. How many amusiacs would need to participate in the study in order to estimate the true mean score difference for all amusiacs to within .05 with 90% confidence? 

  3. 5.92 Shopping on Black Friday. Refer to the International Journal of Retail and Distribution Management (Vol. 39, 2011) survey of Black Friday shoppers, Exercise 5.22 (p. 263). One question was “How many hours do you usually spend shopping on Black Friday?”

    1. How many Black Friday shoppers should be included in a sample designed to estimate (with 95% confidence) the average number of hours spent shopping on Black Friday if you want the estimate to deviate no more than .5 hour from the true mean?

    2. Devise a sampling plan for collecting the data that will likely result in a representative sample.

  4. 5.93 Study of aircraft bird strikes. Refer to the International Journal for Traffic and Transport Engineering (Vol. 3, 2013) study of aircraft bird strikes at a Nigerian airport, Exercise 5.68 (p. 281). Recall that an air traffic controller wants to estimate the true percentage of aircraft bird strikes that occur above 100 feet. Determine how many aircraft bird strikes need to be analyzed in order to estimate the true percentage to within 5% if you use a 95% confidence interval.

  5. 5.94 Bacteria in bottled water. Is the bottled water you drink safe? The Natural Resources Defense Council warns that the bottled water you are drinking may contain more bacteria and other potentially carcinogenic chemicals than are allowed by state and federal regulations. Of the more than 1,000 bottles studied, nearly one-third exceeded government levels. Suppose that the Natural Resources Defense Council wants an updated estimate of the population proportion of bottled water that violates at least one government standard. Determine the sample size (number of bottles) needed to estimate this proportion to within ±0.01 with 99% confidence. 

  6. 5.95 Do you think you smell? Refer to the Depression and Anxiety (June 2010) study of patients who suffer from olfactory reference syndrome (ORS), Exercise 5.72 (p. 282). Recall that psychiatrists disagree over how prevalent ORS is in the human population. Suppose you want to estimate the true proportion of U.S. adults who suffer from ORS using a 99% confidence interval. Determine the size of the sample necessary to attain a sampling error no larger than .04. 

  7. 5.96 Caffeine content of coffee. According to a Food and Drug Administration (FDA) study, a cup of coffee contains an average of 115 milligrams (mg) of caffeine, with the amount per cup ranging from 60 to 180 mg. Suppose you want to repeat the FDA experiment in order to obtain an estimate of the mean caffeine content in a cup of coffee correct to within 5 mg with 95% confidence. How many cups of coffee would have to be included in your sample? 

  8. 5.97 Eye shadow, mascara, and nickel allergies. Refer to the Journal of the European Academy of Dermatology and Venereology (June 2010) study of the link between nickel allergies and use of mascara or eye shadow, Exercise 5.73 (p. 282). Recall that two groups of women were sampled: one group with cosmetic dermatitis from using eye shadow and another group with cosmetic dermatitis from using mascara. In either group, how many women would need to be sampled in order to yield an estimate (with 95% confidence) of the population percentage with a nickel allergy that falls no more than 3% from the true value? 

Applying the Concepts—Advanced

  1. 5.98 Preventing production of defective items. It costs more to produce defective items—since they must be scrapped or reworked—than it does to produce nondefective items. This simple fact suggests that manufacturers should ensure the quality of their products by perfecting their production processes instead of depending on inspection of finished products (Deming, 1986). In order to better understand a particular metal stamping process, a manufacturer wishes to estimate the mean length of items produced by the process during the past 24 hours.

    1. How many parts should be sampled in order to estimate the population mean to within .1 millimeter (mm) with 90% confidence? Previous studies of this machine have indicated that the standard deviation of lengths produced by the stamping operation is about 2 mm.

    2. Time permits the use of a sample size no larger than 100. If a 90% confidence interval for μ is constructed with n=100, will it be wider or narrower than would have been obtained using the sample size determined in part a? Explain.

    3. If management requires that μ be estimated to within .1 mm and that a sample size of no more than 100 be used, what is (approximately) the maximum confidence level that could be attained for a confidence interval that meets management’s specifications?

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.87.156