With the first set of questions asked and answered about this dataset, let's move on to additional analyses.
This recipe will investigate the makes and models of automobiles and how they have changed over time:
carsMake <- ddply(gasCars4, ~year, summarise, numberOfMakes = length(unique(make))) ggplot(carsMake, aes(year, numberOfMakes)) + geom_point() + labs(x = "Year", y = "Number of available makes") + ggtitle("Four cylinder cars")
We see in the following graph that there has been a decline in the number of makes available over this period, though there has been a small uptick in recent times:
uniqMakes <- dlply(gasCars4, ~year, function(x) uniq ue(x$make)) commonMakes <- Reduce(intersect, uniqMakes) commonMakes ## [1] "Ford" "Honda" "Toyota" "Volkswagen" "Chevrolet" ## [6] "Chrysler" "Nissan" "Dodge" "Mazda" "Mitsubishi" ## [11] "Subaru" "Jeep"
carsCommonMakes4 <- subset(gasCars4, make %in% commonMakes) avgMPG_commonMakes <- ddply(carsCommonMakes4, ~year + make, summarise, avgMPG = mean(comb08)) ggplot(avgMPG_commonMakes, aes(year, avgMPG)) + geom_line() + facet_wrap(~make, nrow = 3)
The preceding commands will give you the following plot:
In step 2, there is definitely some interesting magic at work, with a lot being done in only a few lines of code. This is both a beautiful and a problematic aspect of R. It is beautiful because it allows the concise expression of programmatically complex ideas, but it is problematic because R code can be quite inscrutable if you are not familiar with the particular library.
In the first line, we use dlply
(not ddply
) to take the gasCars4
data frame, split it by year, and then apply the unique function to the make
variable. For each year, a list of the unique available automobile makes is computed, and then dlply
returns a list of these lists (one element each year). Note dlply
, and not ddply
, because it takes a data frame (d
) as input and returns a list (l
) as output, whereas ddply
takes a data frame (d
) as input and outputs a data frame (d
):
uniqMakes <- dlply(gasCars4, ~year, function(x) unique(x$make)) commonMakes <- Reduce(intersect, uniqMakes) commonMakes
The next line is even more interesting. It uses the Reduce
higher order function, and this is the same Reduce
function and idea in the map reduce programming paradigm introduced by Google that underlies Hadoop. R is, in some ways, a functional programming language and offers several higher order functions as part of its core. A higher order function accepts another function as input. In this line, we pass the intersect
function to Reduce
, which will apply the intersect
function pairwise to each element in the list of unique makes per year that was created previously. Ultimately, this results in a single list of automobile makes that is present every year.
The two lines of code express a very simple concept (determining all automobile makes present every year) that took two paragraphs to describe.
The final graph in this recipe is an excellent example of the faceted graphics capabilities of ggplot2
. Adding + facet_wrap(~make, nrow = 3)
tells ggplot2
that we want a separate set of axes for each make of automobile and distribute these subplots between three different rows. This is an incredibly powerful data visualization technique as it allows us to clearly see patterns that might only manifest for a particular value of a variable.
We kept things simple in this first data science project. The dataset itself was small—only 12 megabytes uncompressed, easily stored, and handled on a basic laptop. We used R to import the dataset, check the integrity of some (but not all) of the data fields, and summarize the data. We then moved on to exploring the data by asking a number of questions and using two key libraries, plyr
and ggplot2
, to manipulate the data and visualize the results. In this data science pipeline, our final stage was simply the text that we wrote to summarize our conclusions and the visualizations produced by ggplot2
.
13.59.177.14