In an earlier image, we saw three very different distributions, all with the same mean and median. I said then that we need to quantify variance to tell them apart. In the following image, there are three very different distributions, all with the same mean, median, and variance.
If you just rely on basic summary statistics to understand univariate data, you'll never get the full picture. It's only when we visualize it that we can clearly see, at a glance, whether there are any clusters or areas with a high density of data points, the number of clusters there are, whether there are outliers, whether there is a pattern to the outliers, and so on. When dealing with univariate data, the shape is the most important part (that's why this chapter is called Shape of Data!).
We will be using ggplot2's qplot
function to investigate these shapes and visualize these data. qplot
(for quick plot) is the simpler cousin of the more expressive ggplot
function. qplot
makes it easy to produce handsome and compelling graphics using consistent grammar. Additionally, much of the skills, lessons, and know-how from qplot
are transferrable to ggplot
(for when we have to get more advanced).
What's ggplot2? Why are we using it?
There are a few plotting mechanisms for R, including the default one that comes with R (called base R). However, ggplot2
seems to be a lot of people's favorite. This is not unwarranted, given its wide use, excellent documentation, and consistent grammar.
Since the base R graphics subsystem is what I learned to wield first, I've become adept at using it. There are certain types of plots that I produce faster using base R, so I still use it on a regular basis (Figure 2.8 to Figure 2.10 were made using base R!).
Though we will be using ggplot2
for this book, feel free to go your own way when making your very own plots.
Most of the graphics in this section are going to take the following form:
> qplot(column, data=dataframe, geom=...)
where column
is a particular column of the data frame dataframe,
and the geom
keyword argument specifies a geometric object—it will control the type of plot that we want. For visualizing univariate data, we don't have many options for geom
. The three types that we will be using are bar
, histogram
, and density
. Making a bar graph of the frequency distribution of the number of carburetors couldn't be easier:
> library(ggplot2) > qplot(factor(carb), data=mtcars, geom="bar")
Using the factor
function on the carb
column makes the plot look better in this case.
We could, if we wanted to, make an unattractive and distracting plot by coloring all the bars a different color, as follows:
> qplot(factor(carb), + data=mtcars, + geom="bar", + fill=factor(carb), + xlab="number of carburetors")
We also relabeled the x axis (which is automatically set by qplot
) to more informative text.
It's just as easy to make a histogram of the temperature data—the main difference is that we switch geom
from bar
to histogram
:
> qplot(Temp, data=airquality, geom="histogram")
Why doesn't it look like the first histogram in the beginning of the chapter, you ask? Well, that's because of two reasons:
The code I used for the first histogram looked as follows:
> qplot(Temp, data=airquality, geom="histogram", + binwidth=5, color=I("white"))
Making plots of the approximation of the PDF are similarly simple:
> qplot(Temp, data=airquality, geom="density")
By itself, I think the preceding plot is rather unattractive. We can give it a little more flair by:
> qplot(Temp, data=airquality, geom="density", + adjust=.5, # changes bandwidth + fill=I("pink"), + alpha=I(.5), # adds transparency + main="density plot of temperature data")
Now that's a handsome plot!
Notice that we also made the bandwidth smaller than the default (1, which made the PDF more squiggly) and added a title to the plot with the main function.
18.227.81.153