It's time for us to put descriptive statistics down for the time being. It was fun for a while, but we're no longer content just determining the properties of observed data; now we want to start making deductions about data we haven't observed. This leads us to the realm of inferential statistics.
In data analysis, probability is used to quantify uncertainty of our deductions about unobserved data. In the land of inferential statistics, probability reigns queen. Many regard her as a harsh mistress, but that's just a rumor.
Probability measures the likeliness that a particular event will occur. When mathematicians (us, for now!) speak of an event, we are referring to a set of potential outcomes of an experiment, or trial, to which we can assign a probability of occurrence.
Probabilities are expressed as a number between 0 and 1 (or as a percentage out of 100). An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur.
The canonical example of probability at work is a coin flip. In the coin flip event, there are two outcomes: the coin lands on heads, or the coin lands on tails. Pretending that coins never land on their edge (they almost never do), those two outcomes are the only ones possible. The sample space (the set of all possible outcomes), therefore, is {heads, tails}. Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive.
The sum of the probabilities of collectively exhaustive events is always 1. In this example, the probability that the coin flip will yield heads or yield tails is 1; it is certain that the coin will land on one of those. In a fair and correctly balanced coin, each of those two outcomes is equally likely. Therefore, we split the probability equally among the outcomes: in the event of a coin flip, the probability of obtaining heads is 0.5, and the probability of tails is 0.5 as well. This is usually denoted as follows:
The probability of a coin flip yielding either heads or tails looks like this:
And the probability of a coin flip yielding both heads and tails is denoted as follows:
The two outcomes, in addition to being collectively exhaustive, are also mutually exclusive. This means that they can never co-occur. This is why the probability of heads and tails is 0; it just can't happen.
The next obligatory application of beginner probability theory is in the case of rolling a standard six-sided die. In the event of a die roll, the sample space is {1, 2, 3, 4, 5, 6}
. With every roll of the die, we are sampling from this space. In this event, too, each outcome is equally likely, except now we have to divide the probability across six outcomes. In the following equation, we denote the probability of rolling a 1 as P(1):
Rolling a 1 or rolling a 2 is not collectively exhaustive (we can still roll a 3, 4, 5, or 6), but they are mutually exclusive; we can't roll a 1 and 2. If we want to calculate the probability of either one of two mutually exclusive events occurring, we add the probabilities:
While rolling a 1 or rolling a 2 aren't mutually exhaustive, rolling 1 and not rolling a 1 are. This is usually denoted in this manner:
These two events—and all events that are both collectively exhaustive and mutually exclusive—are called complementary events.
Our last pedagogical example in the basic probability theory is using a deck of cards. Our deck has 52 cards—4 for each number from 2 to 10 and 4 each of Jack, Queen, King, and Ace (no Jokers!). Each of these 4 cards belong to one suit, either a Heart, Club, Spade or Diamond. There are, therefore, 13 cards in each suit. Further, every Heart and Diamond card is colored red, and every Spade and Club are black. From this, we can deduce the following probabilities for the outcome of randomly choosing a card:
What, then, is the probability of getting a black card and an Ace? Well, these events are conditionally independent, meaning that the probability of either outcome does not affect the probability of the other. In cases like these, the probability of event A and event B is the product of the probability of A and the probability of B. Therefore:
Intuitively, this makes sense, because there are two black Aces out of a possible 52.
What about the probability that we choose a red card and a Heart? These two outcomes are not conditionally independent, because knowing that the card is red has a bearing on the likelihood that the card is also a Heart. In cases like these, the probability of event A and B is denoted as follows:
Where P(A|B)
means the probability of A given B. For example, if we represent A as drawing a Heart and B as drawing a red card, P(A | B)
means what's the probability of drawing a heart if we know that the card we drew was red?. Since a red card is equally likely to be a Heart or a Diamond, P(A|B)
is 0.5. Therefore:
In the preceding equation, we used the form P(B) P(A|B)
. Had we used the form P(A) P(B|A)
, we would have got the same answer:
So, these two forms are equivalent:
For kicks, let's divide both sides of the equation by P(B). That yields the following equivalence:
This equation is known as Bayes' Theorem. This equation is very easy to derive, but its meaning and influence is profound. In fact, it is one of the most famous equations in all of mathematics.
Bayes' Theorem has been applied to and proven useful in an enormous amount of different disciplines and contexts. It was used to help crack the German Enigma code during World War II, saving the lives of millions. It was also used recently, and famously, by Nate Silver to help correctly predict the voting patterns of 49 states in the 2008 US presidential election.
At its core, Bayes' Theorem tells us how to update the probability of a hypothesis in light of new evidence. Due to this, the following formulation of Bayes' Theorem is often more intuitive:
where H
is the hypothesis and E
is the evidence.
Let's see an example of Bayes' Theorem in action!
There's a hot new recreational drug on the scene called Allighate (or Ally for short). It's named as such because it makes its users go wild and act like an alligator. Since the effect of the drug is so deleterious, very few people actually take the drug. In fact, only about 1 in every thousand people (0.1%) take it.
Frightened by fear-mongering late-night news, Daisy Girl, Inc., a technology consulting firm, ordered an Allighate testing kit for all of its 200 employees so that it could offer treatment to any employee who has been using it. Not sparing any expense, they bought the best kit on the market; it had 99% sensitivity and 99% specificity. This means that it correctly identified drug users 99 out of 100 times, and only falsely identified a non-user as a user once in every 100 times.
When the results finally came back, two employees tested positive. Though the two denied using the drug, their supervisor, Ronald, was ready to send them off to get help. Just as Ronald was about to send them off, Shanice, a clever employee from the statistics department, came to their defense.
Ronald incorrectly assumed that each of the employees who tested positive were using the drug with 99% certainty and, therefore, the chances that both were using it was 98%. Shanice explained that it was actually far more likely that neither employee was using Allighate.
How so? Let's find out by applying Bayes' theorem!
Let's focus on just one employee right now; let H be the hypothesis that one of the employees is using Ally, and E represent the evidence that the employee tested positive.
We want to solve the left side of the equation, so let's plug in values. The first part of the right side of the equation, P(Positive Test | Ally User)
, is called the likelihood. The probability of testing positive if you use the drug is 99%; this is what tripped up Ronald—and most other people when they first heard of the problem. The second part, P(Ally User)
, is called the prior. This is our belief that any one person has used the drug before we receive any evidence. Since we know that only .1% of people use Ally, this would be a reasonable choice for a prior. Finally, the denominator of the equation is a normalizing constant, which ensures that the final probability in the equation will add up to one of all possible hypotheses. Finally, the value we are trying to solve, P(Ally user | Positive Test)
, is the posterior. It is the probability of our hypothesis updated to reflect new evidence.
In many practical settings, computing the normalizing factor is very difficult. In this case, because there are only two possible hypotheses, being a user or not, the probability of finding the evidence of a positive test is given as follows:
Which is: (.99 * .001) + (.01 * .999) = 0.01098
Plugging that into the denominator, our final answer is calculated as follows:
Note that the new evidence, which favored the hypothesis that the employee was using Ally, shifted our prior belief from .001
to .09
. Even so, our prior belief about whether an employee was using Ally was so extraordinarily low, it would take some very very strong evidence indeed to convince us that an employee was an Ally user.
Ignoring the prior probability in cases like these is known as base-rate fallacy. Shanice assuaged Ronald's embarrassment by assuring him that it was a very common mistake.
Now to extend this to two employees: the probability of any two employees both using the drug is, as we now know, .01 squared, or 1 million to one. Squaring our new posterior yields, we get .0081
. The probability that both employees use Ally, even given their positive results, is less than 1%. So, they are exonerated.
Sally is a different story, though. Her friends noticed her behavior had dramatically changed as of late—she snaps at co-workers and has taken to eating pencils. Her concerned cubicle-mate even followed her after work and saw her crawl into a sewer, not to emerge until the next day to go back to work.
Even though Sally passed the drug test, we know that it's likely (almost certain) that she uses Ally. Bayes' theorem gives us a way to quantify that probability!
Our prior is the same, but now our likelihood is pretty much as close to 1 as you can get - after all, how many non-Ally users do you think eat pencils and live in sewers?
3.145.88.241