Visualization methods

In an earlier image, we saw three very different distributions, all with the same mean and median. I said then that we need to quantify variance to tell them apart. In the following image, there are three very different distributions, all with the same mean, median, and variance.

Visualization methods

Figure 2.10: Three PDFs with the same mean, median, and standard deviation

If you just rely on basic summary statistics to understand univariate data, you'll never get the full picture. It's only when we visualize it that we can clearly see, at a glance, whether there are any clusters or areas with a high density of data points, the number of clusters there are, whether there are outliers, whether there is a pattern to the outliers, and so on. When dealing with univariate data, the shape is the most important part (that's why this chapter is called Shape of Data!).

We will be using ggplot2's qplot function to investigate these shapes and visualize these data. qplot (for quick plot) is the simpler cousin of the more expressive ggplot function. qplot makes it easy to produce handsome and compelling graphics using consistent grammar. Additionally, much of the skills, lessons, and know-how from qplot are transferrable to ggplot (for when we have to get more advanced).

Note

What's ggplot2? Why are we using it?

There are a few plotting mechanisms for R, including the default one that comes with R (called base R). However, ggplot2 seems to be a lot of people's favorite. This is not unwarranted, given its wide use, excellent documentation, and consistent grammar.

Since the base R graphics subsystem is what I learned to wield first, I've become adept at using it. There are certain types of plots that I produce faster using base R, so I still use it on a regular basis (Figure 2.8 to Figure 2.10 were made using base R!).

Though we will be using ggplot2 for this book, feel free to go your own way when making your very own plots.

Most of the graphics in this section are going to take the following form:

  > qplot(column, data=dataframe, geom=...)

where column is a particular column of the data frame dataframe, and the geom keyword argument specifies a geometric object—it will control the type of plot that we want. For visualizing univariate data, we don't have many options for geom. The three types that we will be using are bar, histogram, and density. Making a bar graph of the frequency distribution of the number of carburetors couldn't be easier:

  > library(ggplot2)
  > qplot(factor(carb), data=mtcars, geom="bar")
Visualization methods

Figure 2.11: Frequency distribution of the number of carburetors

Using the factor function on the carb column makes the plot look better in this case.

We could, if we wanted to, make an unattractive and distracting plot by coloring all the bars a different color, as follows:

  > qplot(factor(carb),
  +       data=mtcars,
  +       geom="bar",
  +       fill=factor(carb),
  +       xlab="number of carburetors")
Visualization methods

Figure 2.12: With color and label modification

We also relabeled the x axis (which is automatically set by qplot) to more informative text.

It's just as easy to make a histogram of the temperature data—the main difference is that we switch geom from bar to histogram:

  > qplot(Temp, data=airquality, geom="histogram")
Visualization methods

Figure 2.13: Histogram of temperature data

Why doesn't it look like the first histogram in the beginning of the chapter, you ask? Well, that's because of two reasons:

  • I adjusted the bin width (size of the bins)
  • I added color to the outline of the bars

The code I used for the first histogram looked as follows:

  > qplot(Temp, data=airquality, geom="histogram",
  +       binwidth=5, color=I("white"))

Making plots of the approximation of the PDF are similarly simple:

  > qplot(Temp, data=airquality, geom="density")
Visualization methods

Figure 2.14: PDF of temperature data

By itself, I think the preceding plot is rather unattractive. We can give it a little more flair by:

  • Filling the curve pink
  • Adding a little transparency to the fill
      > qplot(Temp, data=airquality, geom="density",
      +       adjust=.5,       # changes bandwidth
      +       fill=I("pink"), 
      +       alpha=I(.5),     # adds transparency
      +       main="density plot of temperature data")
    Visualization methods

    Figure 2.15: Figure 2.14 with modifications

Now that's a handsome plot!

Notice that we also made the bandwidth smaller than the default (1, which made the PDF more squiggly) and added a title to the plot with the main function.

Visualization methods
Visualization methods
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.81.153