Up until this point, when we spoke of distributions, we were referring to frequency distributions. However, when we talk about distributions later in the book—or when other data analysts refer to them—we will be talking about probability distributions, which are much more general.
It's easy to turn a categorical, discrete, or discretized frequency distribution into a probability distribution. As an example, refer to the frequency distribution of carburetors in the first image in this chapter. Instead of asking What number of cars have n number of carburetors?, we can ask, What is the probability that, if I choose a car at random, I will get a car with n carburetors?
We will talk more about probability (and different interpretations of probability) in Chapter 4, but for now, probability is a value between 0 and 1 (or 0 percent and 100 percent) that measures how likely an event is to occur. To answer the question What's the probability that I will pick a car with 4 carburetors?, the equation is:
You can find the probability of picking a car of any one particular number of carburetors as follows:
> table(mtcars$carb) / length(mtcars$carb) 1 2 3 4 6 8 0.21875 0.31250 0.09375 0.31250 0.03125 0.03125
Instead of making a bar chart of the frequencies, we can make a bar chart of the probabilities.
This is called a probability mass function (PMF). It looks the same, but now it maps from carburetors to probabilities, not frequencies. Figure 2.6a represents this.
And, just as it is with the bar chart, we can easily tell that 2
and 4
are the number of carburetors most likely to be chosen at random.
We could do the same with discretized numeric variables as well. The following images are a representation of the temperature histogram as a probability mass function.
Note that this PMF only describes the temperatures of NYC in the data we have.
There's a problem here, though— this PMF is completely dependent on the size of bins (our method of discretizing the temperatures). Imagine that we constructed the bins such that each bin held only one temperature within a degree. In this case, we wouldn't be able to tell very much from the PMF at all, since each specific degree only occurs a few times, if any, in the dataset. The same problem—but worse!—happens when we try to describe continuous variables with probabilities without discretizing them at all. Imagine trying to visualize the probability (or the frequency) of the temperatures if they were measured to the thousandth place (for example, {90.167, 67.361, ..}
). There would be no visible bars at all!
What we need here is a probability density function (PDF). A probability density function will tell us the relative likelihood that we will experience a certain temperature. The next image shows a PDF that fits the temperature data that we've been playing with; it is analogous to, but better than, the histogram we saw in the beginning of the chapter and the PMF in the preceding figure.
The first thing you'll notice about this new plot is that it is smooth, not jagged or boxy like the histogram and PMFs. This should intuitively make more sense, because temperatures are a continuous variable, and there is likely to be no sharp cutoffs in the probability of experiencing temperatures from one degree to the next.
The second thing you should notice is that the units and the values on the y axis have changed. The y axis no longer represents probabilities—it now represents probability densities. Though it may be tempting, you can't look at this function and answer the question What is the probability that it will be exactly 80 degrees?. Technically, the probability of it being 80.0000 exactly is microscopically small, almost zero. But that's okay! Remember, we don't care what the probability of experiencing a temperature of 80.0000 is—we just care the probability of a temperature around there.
We can answer the question What's the probability that the temperature will be between a particular range?. The probability of experiencing a temperature, say 80 to 90 degrees, is the area under the curve from 80 to 90. Those of you unfortunate readers who know calculus will recognize this as the integral, or anti-derivative, of the PDF evaluated over the range,
where f(x)
is the probability density function.
The next image shows the area under the curve for this range in pink. You can immediately see that the region covers a lot of area—perhaps one third. According to R, it's about 34 percent.
> temp.density <- density(airquality$Temp) > pdf <- approxfun(temp.density$x, temp.density$y, rule=2) > integrate(pdf, 80, 90) 0.3422287 with absolute error < 7.5e-06
We don't get a probability density function from the sample for free. The PDF has to be estimated. The PDF isn't so much trying to convey the information about the sample we have as attempting to model the underlying distribution that gave rise to that sample.
To do this, we use a method called kernel density estimation. The specifics of kernel density estimation are beyond the scope of this book, but you should know that the density estimation is heavily governed by a parameter that controls the smoothness of the estimation. This is called the bandwidth.
How do we choose the bandwidth? Well, it's just like choosing the size to make the bins in a histogram: there's no right answer. It's a balancing act between reducing chance or noise in the model and not losing important information by smoothing over pertinent characteristics of the data. This is a tradeoff we will see time and time again throughout this text.
Anyway, the great thing about PDFs is that you don't have to know calculus to interpret PDFs. Not only are PDFs a useful tool analytically, but they make for a top-notch visualization of the shape of data.
By the way…
Remember when we were talking about modes, and I said that finding the mode of non-discretized continuously distributed data is a more complicated procedure than for discretized or categorical data? The mode for these types of univariate data is the peak of the PDF. So, in the temperature example, the mode is around 80 degrees.
18.119.143.17