Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 5. Using Data to Reason About the World

In Chapter 4, Probability, we mentioned that the mean height of US females is 65 inches. Now pretend we didn't know this fact—how could we find out what the average height is?

We can measure every US female, but that's untenable; we would run out of money, resources, and time before we even finished with a small city!

Inferential statistics gives us the power to answer this question using a very small sample of all US women. We can use the sample to tell us something about the population we drew it from. We can use observed data to make inferences about unobserved data. By the end of this chapter, you too will be able to go out and collect a small amount of data and use it to reason about the world!

Estimating means

In the example that is going to span this entire chapter, we are going to be examining how we would estimate the mean height of all US women using only samples. Specifically, we will be estimating the population parameters using samples' means as an estimator.

I am going to use the vector all.us.women to represent the population. For simplicity's sake, let's say there are only 10,000 US women.

  > # setting seed will make random number generation reproducible
  > set.seed(1)
  > all.us.women <- rnorm(10000, mean=65, sd=3.5)

We have just created a vector of 10,000 normally distributed random variables with the same parameters as our population of interest using the rnorm function. Of course, at this point, we can just call mean on this vector and call it a day—but that's cheating! We are going to see that we can get really really close to the population mean without actually using the entire population.

Now, let's take a random sample of ten from this population using the sample function and compute the mean:

  > our.sample <- sample(all.us.women, 10)
  > mean(our.sample)
  [1] 64.51365

Hey, not a bad start!

Our sample will, in all likelihood, contain some short people, some normal people, and some tall people. There's a chance that when we choose a sample that we choose one that contains predominately short people, or a disproportionate number of tall people. Because of this, our estimate will not be exactly accurate. However, as we choose more and more people to include in our sample, those chance occurrences—imbalanced proportions of the short and tall—tend to balance each other out.

Note that as we increase our sample size, the sample mean isn't always closer to the population mean, but it will be closer on average.

We can test that assertion ourselves! Study the following code carefully and try running it yourself.

  > population.mean <- mean(all.us.women)
  >
  > for(sample.size in seq(5, 30, by=5)){
  +   # create empty vector with 1000 elements
  +   sample.means <- numeric(1000)
  +   for(i in 1:1000){
  +     sample.means[i] <- mean(sample(all.us.women, sample.size))
  +   }
  +   distances.from.true.mean <- abs(sample.means - population.mean)
  +   mean.distance.from.true.mean <- mean(distances.from.true.mean)
  +   print(mean.distance.from.true.mean)
  + }
  [1] 1.245492
  [1] 0.8653313
  [1] 0.7386099
  [1] 0.6355692
  [1] 0.5458136
  [1] 0.5090788

For each sample size from 5 to 30 (going up by 5), we will take 1000 different samples from the population, calculate their mean, take the differences from the population mean, and average them.

Figure 5.1: Accuracy of sample means as a function of sample size

As you can see, increasing the sample size gets us closer to the population mean. Increasing the sample size also reduces the standard deviation between the means of the samples.

Figure 5.2: The variability of sample means as a function of sample size

Knowing that, with all other things being equal, larger samples are preferable to smaller ones, let's work with a sample size of 40 for right now. We'll take our sample and estimate our population mean as follows:

  > mean(our.new.sample)
  [1] 65.19704

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 5. Using Data to Reason About the World

Create new playlist

Sign In

Sign Up

Chapter 5. Using Data to Reason About the World

Estimating means

Table of Contents for
5. Using Data to Reason About the World