We have now successfully imported the data and looked at some important high-level statistics that provided us with a basic understanding of what values are in the dataset and how frequently some features appear. With this recipe, we continue the exploration by looking at some of the fuel efficiency metrics over time and in relation to other data points.
The following steps will use both plyr
and the graphing library, ggplot2
, to explore the dataset:
ddply
function from the plyr
package to take the vehicles
data frame, aggregate rows by year, and then, for each group, we compute the mean highway, city, and combine fuel efficiency. The result is then assigned to a new data frame, mpgByYr
. Note that this is our first example of split-apply-combine. We split the data frame into groups by year, we apply the mean function to specific variables, and then we combine the results into a new data frame:mpgByYr <- ddply(vehicles, ~year, summarise, avgMPG = mean(comb08), avgHghy = mean(highway08), avgCity = mean(city08))
ggplot
function, telling it to plot the avgMPG
variable against the year
variable, using points. In addition, we specify that we want axis labels, a title, and even a smoothed conditional mean (geom_smooth()
) represented as a shaded region of the plot:ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("All cars") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method.
The preceding commands will give you the following plot:
table(vehicles$fuelType1) ## Diesel Electricity Midgrade Gasoline Natural Gas ## 1025 56 41 57 ## Premium Gasoline Regular Gasoline ## 8521 24587
gasCars
, which only contains the rows of vehicles in which the fuelType1
variable is one among a subset of values:gasCars <- subset(vehicles, fuelType1 %in% c("Regular Gasoline", "Premium Gasoline", "Midgrade Gasoline") & fuelType2 == "" & atvType != "Hybrid") mpgByYr_Gas <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08)) ggplot(mpgByYr_Gas, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("Gasoline cars") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method.
The preceding commands will give you the following plot:
displ
variable, which represents the displacement of the engine in liters, is currently a string variable that we need to convert to a numeric variable:typeof(gasCars$displ) ## "character" gasCars$displ <- as.numeric(gasCars$displ) ggplot(gasCars, aes(displ, comb08)) + geom_point() + geom_smooth() ## geom_smooth: method="auto" and size of largest group is >=1000, so using ## gam with formula: y ~ s(x, bs = "cs"). Use 'method = x' to change the ## smoothing method. ## Warning: Removed 2 rows containing missing values (stat_smooth). ## Warning: Removed 2 rows containing missing values (geom_point).
The preceding commands will give you the following plot:
This scatter plot of the data offers the convincing evidence that there is a negative, or even inverse correlation, between engine displacement and fuel efficiency; thus, smaller cars tend to be more fuel-efficient.
avgCarSize <- ddply(gasCars, ~year, summarise, avgDispl = mean(displ)) ggplot(avgCarSize, aes(year, avgDispl)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average engine displacement (l)") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method. ## Warning: Removed 1 rows containing missing values (stat_smooth). ## Warning: Removed 1 rows containing missing values (geom_point).
The preceding commands will give you the following plot:
ddply
, we create a new data frame, byYear
, which contains both the average fuel efficiency and the average engine displacement by year:byYear <- ddply(gasCars, ~year, summarise, avgMPG = mean(comb08), avgDispl = mean(displ)) > head(byYear) year avgMPG avgDispl 1 1984 19.12162 3.068449 2 1985 19.39469 NA 3 1986 19.32046 3.126514 4 1987 19.16457 3.096474 5 1988 19.36761 3.113558 6 1989 19.14196 3.133393
head
function shows us that the resulting data frame has three columns: year
, avgMPG
, and avgDispl
. To use the faceting capability of ggplot2
to display Average MPG
and Avg engine displacement
by year on separate but aligned plots, we must melt the data frame, converting it from what is known as a wide format to a long format:byYear2 = melt(byYear, id = "year")levels(byYear2$variable) <- c("Average MPG", "Avg engine displacement") head(byYear2) year variable value 1 1984 Average MPG 19.12162 2 1985 Average MPG 19.39469 3 1986 Average MPG 19.32046 4 1987 Average MPG 19.16457 5 1988 Average MPG 19.36761 6 1989 Average MPG 19.14196
If we use the nrow
function, we can see that the byYear2
data frame has 62 rows and the byYear
data frame has only 31. The two separate columns from byYear
(avgMPG
and avgDispl
) have now been melted into one new column (value
) in the byYear2
data frame. Note that the variable column in the byYear2
data frame serves to identify the column that the value represents:
ggplot(byYear2, aes(year, value)) + geom_point() + geom_smooth() + facet_wrap(~variable, ncol = 1, scales = "free_y") + xlab("Year") + ylab("") ## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method.## geom_smooth: method="auto" and size of largest group is <1000, so using ## loess. Use 'method = x' to change the smoothing method. ## Warning: Removed 1 rows containing missing values (stat_smooth). ## Warning: Removed 1 rows containing missing values (geom_point).
The preceding commands will give you the following plot:
From this plot, we can see the following:
gasCars4 <- subset(gasCars, cylinders == "4") ggplot(gasCars4, aes(factor(year), comb08)) + geom_boxplot() + facet_wrap(~trany2, ncol = 1) + theme(axis.text.x = element_text(angle = 45)) + labs(x = "Year", y = "MPG")
The preceding command will give you the following plot:
This time, ggplot2
was used to create box plots that help visualize the distribution of values (and not just a single value, such as a mean) for each year.
ggplot(gasCars4, aes(factor(year), fill = factor(trany2))) + geom_bar(position = "fill") + labs(x = "Year", y = "Proportion of cars", fill = "Transmission") + theme(axis.text.x = element_text(angle = 45)) + geom_hline(yintercept = 0.5, linetype = 2)
The preceding command will give you the following plot:
In step 9, it appears that manual transmissions are more efficient than automatic transmissions, and they both exhibit the same increase, on an average, since 2008. However, there is something odd here. There appear to be many very efficient cars (less than 40 MPG) with automatic transmissions in later years, and almost no manual transmission cars with similar efficiencies in the same time frame. The pattern is reversed in earlier years. Is there a change in the proportion of manual cars available each year? Yes. What are these very efficient cars? In the next section, we look at the makes and models of the cars in the database.
With this recipe, we threw you into the deep end of data analysis with R, using two very important R packages, plyr
and ggplot2
. Just as traditional software development has design patterns for common constructs, a few such patterns are emerging in the field of data science. One of the most notable is the split-apply-combine pattern highlighted by Dr. Hadley Wickham. In this strategy, one breaks up the problem into smaller, more manageable pieces by some variable. Once aggregated, you perform an operation on the new grouped data, and then combine the results into a new data structure. As you can see in this recipe, we used this strategy of split-apply-combine repeatedly, examining the data from many different perspectives, as a result.
Beyond plyr
, this recipe heavily leveraged the ggplot2
library, which deserves additional exposition. We will refrain from providing an extensive ggplot2
tutorial as there are a number of excellent tutorials available online. What is important is that you understand the important idea of how ggplot2
allows you to construct such complex statistical visualizations in such a terse fashion.
The ggplot2
library is an open source implementation of the foundational grammar of graphics by Wilkinson, Anand, and Grossman for R. The Grammar of Graphics attempts to decompose statistical data visualizations into component parts to better understand how such graphics are created. With ggplot2
, Hadley Wickham, takes these ideas and implements a layered approach, allowing the user to assemble complex visualizations from individual pieces very quickly. Take, for example, the first graph for this recipe, which shows the average fuel efficiency of all models of cars in a particular year over time:
ggplot(mpgByYr, aes(year, avgMPG)) + geom_point() + geom_smooth() + xlab("Year") + ylab("Average MPG") + ggtitle("All cars")
To construct this plot, we first tell ggplot
the data frame that will serve as the data for the plot (mpgByYr
), and then the aesthetic mappings that will tell ggplot2
which variables will be mapped into visual characteristics of the plot. In this case, aes(year, avgMPG)
implicitly specifies that the year will be mapped to the x axis and avgMPG
will be mapped to the y axis. Geom_point()
tells the library to plot the specified data as points and a second geom, geom_smooth()
, adds a shaded region showing the smoothed mean (with a confidence interval set to 0.95
, by default) for the same data. Finally, the xlab()
, ylab()
, and ggtitle()
functions are used to add labels to the plot. Thus, we can generate a complex, publication quality graph in a single line of code; ggplot2
is capable of doing far more complex plots.
Also, it is important to note that ggplot2
, and the grammar of graphics in general, does not tell you how best to visualize your data, but gives you the tools to do so rapidly. If you want more advice on this topic, we strongly recommend looking into the works of Edward Tufte, who has numerous books on the matter, including the classic The Visual Display of Quantitative Information, Graphics Press USA. Further, ggplot2
does not allow for dynamic data visualizations.
3.19.75.133