Chapter 7. Bayesian Methods

Suppose I claim that I have a pair of magic rainbow socks. I allege that whenever I wear these special socks, I gain the ability to predict the outcome of coin tosses, using fair coins, better than chance would dictate. Putting my claim to the test, you toss a coin 30 times, and I correctly predict the outcome 20 times. Using a directional hypothesis with the binomial test, the null hypothesis would be rejected at alpha-level 0.05. Would you invest in my special socks?

Why not? If it's because you require a larger burden of proof on absurd claims, I don't blame you. As a grandparent of Bayesian analysis Pierre-Simon Laplace (who independently discovered the theorem that bears Thomas Bayes' name) once said: The weight of evidence for an extraordinary claim must be proportioned to its strangeness. Our prior belief—my absurd hypothesis—is so small that it would take much stronger evidence to convince the skeptical investor, let alone the scientific community.

Unfortunately, if you'd like to easily incorporate your prior beliefs into NHST, you're out of luck. Or suppose you need to assess the probability of the null hypothesis; you're out of luck there, too; NHST assumes the null hypothesis and can't make claims about the probability that a particular hypothesis is true. In cases like these (and in general), you may want to use Bayesian methods instead of frequentist methods. This chapter will tell you how. Join me!

The big idea behind Bayesian analysis

If you recall from Chapter 4, Probability, the Bayesian interpretation of probability views probability as our degree of belief in a claim or hypothesis, and Bayesian inference tells us how to update that belief in the light of new evidence. In that chapter, we used Bayesian inference to determine the probability that employees of Daisy Girl, Inc. were using an illegal drug. We saw how the incorporation of prior beliefs saved two employees from being falsely accused and helped another employee get the help she needed even though her drug screen was falsely negative.

In a general sense, Bayesian methods tell us how to dole out credibility to different hypotheses, given prior belief in those hypotheses and new evidence. In the drug example, the hypothesis suite was discrete: drug user or not drug user. More commonly, though, when we perform Bayesian analysis, our hypothesis concerns a continuous parameter, or many parameters. Our posterior (or updated beliefs) was also discrete in the drug example, but Bayesian analysis usually yields a continuous posterior called a posterior distribution.

We are going to use Bayesian analysis to put my magical rainbow socks claim to the test. Our parameter of interest is the proportion of coin tosses that I can correctly predict wearing the socks; we'll call this parameter θ, or theta. Our goal is to determine what the most likely values of theta are and whether they constitute proof of my claim.

Refer back to the section on Bayes' theorem in Chapter 4, Probability Recall that the posterior was the prior times the likelihood divided by a normalizing constant. This normalizing constant is often difficult to compute. Luckily, since it doesn't change the shape of the posterior distribution, and we are comparing relative likelihoods and probability densities, Bayesian methods often ignore this constant. So, all we need is a probability density function to describe our prior belief and a likelihood function that describes the likelihood that we would get the evidence we received given different parameter values.

The likelihood function is a binomial function, as it describes the behavior of Bernoulli trials; the binomial likelihood function for this evidence is shown in Figure 7.1:

The big idea behind Bayesian analysis

Figure 7.1: The likelihood function of theta for 20 out of 30 successful Bernoulli trials.

For different values of theta, there are varying relative likelihoods. Note that the value of theta that corresponds to the maximum of the likelihood function is 0.667, which is the proportion of successful Bernoulli trials. This means that in the absence of any other information, the most likely proportion of coin flips that my magic socks allow me to predict is 67%. This is called the Maximum Likelihood Estimate (MLE).

So, we have the likelihood function; now we just need to choose a prior. We will be crafting a representation of our prior beliefs using a type of distribution called a beta distribution, for reasons that we'll see very soon.

Since our posterior is a blend of the prior and likelihood function, it is common for analysts to use a prior that doesn't much influence the results and allows the likelihood function to speak for itself. To this end, one may choose to use a non-informative prior that assigns equal credibility to all values of theta. This type of non-informative prior is called a flat or uniform prior.

The beta distribution has two hyper-parameters, α (or alpha) and β (or beta). A beta distribution with hyper-parameters α = β = 1 describes such a flat prior. We will call this prior #1.

Note

These are usually referred to as the beta distribution's parameters. We call them hyper-parameters here to distinguish them from our parameter of interest, theta.

The big idea behind Bayesian analysis

Figure 7.2: A flat prior on the value of theta. This beta distribution, with alpha and beta = 1, confers an equal level of credibility to all possible values of theta, our parameter of interest.

This prior isn't really indicative of our beliefs, is it? Do we really assign as much probability to my socks giving me perfect coin-flip prediction powers as we do to the hypothesis that I'm full of baloney?

The prior that a skeptic might choose in this situation is one that looks more like the one depicted in Figure 7.3, a beta distribution with hyper-parameters alpha = beta = 50. This, rather appropriately, assigns far more credibility to values of theta that are concordant with a universe without magical rainbow socks. As good scientists, though, we have to be open-minded to new possibilities, so this doesn't rule out the possibility that the socks give me special powers—the probability is low, but not zero, for extreme values of theta. We will call this prior #2.

The big idea behind Bayesian analysis

Figure 7.3: A skeptic's prior

Before we perform the Bayesian update, I need to explain why I chose to use the beta distribution to describe my priors.

The Bayesian update—getting to the posterior—is performed by multiplying the prior with the likelihood. In the vast majority of applications of Bayesian analysis, we don't know what that posterior looks like, so we have to sample from it many times to get a sense of its shape. We will be doing this later in this chapter.

For cases like this, though, where the likelihood is a binomial function, using a beta distribution for our prior guarantees that our posterior will also be in the beta distribution family. This is because the beta distribution is a conjugate prior with respect to a binomial likelihood function. There are many other cases of distributions being self-conjugate with respect to certain likelihood functions, but it doesn't often happen in practice that we find ourselves in a position to use them as easily as we can for this problem. The beta distribution also has the nice property that it is naturally confined from 0 to 1, just like the proportion of coin flips I can correctly predict.

The fact that we know how to compute the posterior from the prior and likelihood by just changing the beta distribution's hyper-parameters makes things really easy in this case. The hyper-parameters of the posterior distribution are:

The big idea behind Bayesian analysis

That means the posterior distribution using prior #1 will have hyper-parameters alpha=1+20 and beta=1+10. This is shown in Figure 7.4.

The big idea behind Bayesian analysis

Figure 7.4: The result of the Bayesian update of the evidence and prior #1. The interval depicts the 95% credible interval (the densest 95% of the area under the posterior distribution). This interval overlaps slightly with theta = 0.5.

A common way of summarizing the posterior distribution is with a credible interval. The credible interval on the plot in Figure 7.4 is the 95% credible interval and contains 95% of the densest area under the curve of the posterior distribution.

Do not confuse this with a confidence interval. Though it may look like it, this credible interval is very different than a confidence interval. Since the posterior directly contains information about the probability of our parameter of interest at different values, it is admissible to claim that there is a 95% chance that the correct parameter value is in the credible interval. We could make no such claim with confidence intervals. Please do not mix up the two meanings, or people will laugh you out of town.

Observe that the 95% most likely values for theta contain the theta value 0.5, if only barely. Due to this, one may wish to say that the evidence does not rule out the possibility that I'm full of baloney regarding my magical rainbow socks, but the evidence was suggestive.

To be clear, the end result of our Bayesian analysis is the posterior distribution depicting the credibility of different values of our parameter. The decision to interpret this as sufficient or insufficient evidence for my outlandish claim is a decision that is separate from the Bayesian analysis proper. In contrast to NHST, the information we glean from Bayesian methods—the entire posterior distribution—is much richer. Another thing that makes Bayesian methods great is that you can make intuitive claims about the probability of hypotheses and parameter values in a way that frequentist NHST does not allow you to do.

What does that posterior using prior #2 look like? It's a beta distribution with alpha = 50 + 20 and beta = 50 + 10:

  > curve(dbeta(x, 70, 60),         # plot a beta distribution
  +       xlab="θ",                 # name x-axis
  +       ylab="posterior belief",  # name y-axis
  +       type="l",                 # make smooth line
  +       yaxt='n')                 # remove y axis labels
  > abline(v=.5, lty=2)             # make line at theta = 0.5

The big idea behind Bayesian analysis

Figure 7.5: Posterior distribution of theta using prior #2

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.255.174