Confidence intervals

While point estimates are okay, estimates of a population parameter and sampling distributions are even better. There are the following two main issues with these approaches:

  • Single point estimates are very prone to error (due to sampling bias among other things)
  • Taking multiple samples of a certain size for sampling distributions might not be feasible, and may sometimes be even more infeasible than actually finding the population parameter

For these reasons and more, we may turn to a concept known as the confidence interval to find statistics.

A confidence interval is a range of values based on a point estimate that contains the true population parameter at some confidence level.

Confidence is an important concept in advanced statistics. Its meaning is sometimes misconstrued. Informally, a confidence level does not represent a probability of being correct; instead, it represents the frequency that the obtained answer will be accurate. For example, if you want to have a 95% chance of capturing the true population parameter using only a single point estimate, you have to set your confidence level to 95%.

Note

Higher confidence levels result in wider (larger) confidence intervals in order to be more sure.

Calculating a confidence interval involves finding a point estimate and then incorporating a margin of error to create a range. The margin of error is a value that represents our certainty that our point estimate is accurate and is based on our desired confidence level, the variance of the data, and how big your sample is. There are many ways to calculate confidence intervals; for the purpose of brevity and simplicity, we will look at a single way of taking the confidence interval of a population mean. For this confidence interval, we need the following:

  • A point estimate. For this, we will take our sample mean of break lengths from our previous example.
  • An estimate of the population standard deviation, which represents the variance in the data. This is calculated by taking the sample standard deviation (the standard deviation of the sample data) and dividing that number by the square root of the population size.
  • The degrees of freedom (which is the -1 sample size).

Obtaining these numbers might seem arbitrary but, trust me, there is a reason for all of them. However, again for simplicity, I will use prebuilt Python modules, as shown, to calculate our confidence interval and then demonstrate its value:

import math
sample_size = 100 

# the size of the sample we wish to take 

 

sample = np.random.choice(a= breaks, size = sample_size) 

# a sample of sample_size taken from the 9,000 breaks population from before 

 

sample_mean = sample.mean() 

# the sample mean of the break lengths sample 

 

sample_stdev = sample.std()     

# sample standard deviation 

 

sigma = sample_stdev/math.sqrt(sample_size)   

# population standard deviation estimate 

 

stats.t.interval(alpha = 0.95,              # Confidence level 95% 

                 df= sample_size - 1,       # Degrees of freedom 

                 loc = sample_mean,         # Sample mean 

                 scale = sigma)             # Standard deviation 

# (36.36, 45.44) 

To reiterate, this range of values (from 36.36 to 45.44) represents a confidence interval for the average break time with 95% confidence.

We already know that our population parameter is 39.99, and note that the interval includes the population mean of 39.99.

I mentioned earlier that the confidence level is not a percentage of accuracy of our interval but the percent chance that the interval will even contain the population parameter at all.

To better understand the confidence level, let's take 10,000 confidence intervals and see how often our population means falls in the interval. First, let's make a function, as illustrated, that makes a single confidence interval from our breaks data:

# function to make confidence interval 

def makeConfidenceInterval(): 

    sample_size = 100 

    sample = np.random.choice(a= breaks, size = sample_size) 

 

    sample_mean = sample.mean() 

    # sample mean 

 

    sample_stdev = sample.std()     

    # sample standard deviation 

 

    sigma = sample_stdev/math.sqrt(sample_size)   

    # population Standard deviation estimate 

 

    return stats.t.interval(alpha = 0.95, df= sample_size - 1, loc = sample_mean, scale = sigma)  

Now that we have a function that will create a single confidence interval, let's create a procedure that will test the probability that a single confidence interval will contain the true population parameter, 39.99:

  1. Take 10,000 confidence intervals of the sample mean.
  2. Count the number of times that the population parameter falls into our confidence intervals.
  3. Output the ratio of the number of times the parameter fell into the interval by 10,000:
    times_in_interval = 0. 
    
    for i in range(10000): 
    
        interval = makeConfidenceInterval() 
    
        if 39.99 >= interval[0] and 39.99 <= interval[1]: 
    
        # if 39.99 falls in the interval 
    
            times_in_interval += 1 
    
     
    
    print(times_in_interval / 10000) 
    
    # 0.9455 

Success! We see that about 95% of our confidence intervals contained our actual population mean. Estimating population parameters through point estimates and confidence intervals is a relatively simple and powerful form of statistical inference.

Let's also take a quick look at how the size of confidence intervals changes as we change our confidence level. Let's calculate confidence intervals for multiple confidence levels and look at how large the intervals are by looking at the difference between the two numbers. Our hypothesis will be that as we make our confidence level larger, we will likely see larger confidence intervals to be surer that we catch the true population parameter:

for confidence in (.5, .8, .85, .9, .95, .99): 

    confidence_interval = stats.t.interval(alpha = confidence, df= sample_size - 1, loc = sample_mean, scale = sigma)    

                     

    length_of_interval = round(confidence_interval[1] - confidence_interval[0], 2) 

    # the length of the confidence interval 

     

    print( "confidence {0} has a interval of size {1}".format(confidence, length_of_interval)) 

 

confidence 0.5 has an interval of size 2.56 

confidence 0.8 has an interval of size 4.88 

confidence 0.85 has an interval of size 5.49 

confidence 0.9 has an interval of size 6.29 

confidence 0.95 has an interval of size 7.51 

confidence 0.99 has an interval of size 9.94 

We can see that as we wish to be more confident in our interval, our interval expands in order to compensate.

Next, we will take our concept of confidence levels and look at statistical hypothesis testing in order to both expand on these topics and also create (usually) even more powerful statistical inferences.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.196.234