Reshaping data in a flexible way

Hadley Wickham has written several R packages to tweak data structures, for example, a major part of his thesis concentrated on how to reshape data frames with his reshape package. Since then, this general aggregation and restructuring package has been renewed to be more efficient with the most commonly used tasks, and it was released with a new version number attached to the name: reshape2 package.

This was a total rewrite of the reshape package, which improves speed at the cost of functionality. Currently, the most important feature of reshape2 is the possibility to convert between the so-called long (narrow) and wide tabular data format. This basically pertains to the columns being stacked below each other, or arranged beside each other.

These features were presented in Hadley's works with the following image on data restructuring, with the related reshape functions and simple use cases:

Reshaping data in a flexible way

As the reshape package is not under active development anymore, and its parts were outsourced to reshape2, plyr, and most recently to dplyr, we will only focus on the commonly used features of reshape2 in the following pages. This will basically consist of the melt and cast functions, which provides a smart way of melting the data into a standardized form of measured and identifier variables (long table format), which can later be casted to a new shape for further analysis.

Converting wide tables to the long table format

Melting a data frame means that we transform the tabular data to key-value pairs, based on the given identifier variables. The original column names become the categories of the newly created variable column, while all numeric values of those (measured variables) are included in the new value column. Here's a quick example:

> library(reshape2)
> head(melt(hflights))
Using UniqueCarrier, TailNum, Origin, Dest, CancellationCode as id variables
  UniqueCarrier TailNum Origin Dest CancellationCode variable value
1            AA  N576AA    IAH  DFW                      Year  2011
2            AA  N557AA    IAH  DFW                      Year  2011
3            AA  N541AA    IAH  DFW                      Year  2011
4            AA  N403AA    IAH  DFW                      Year  2011
5            AA  N492AA    IAH  DFW                      Year  2011
6            AA  N262AA    IAH  DFW                      Year  2011

So, we have just restructured the original data.frame, which had 21 variables and a quarter of a million records, into only 7 columns and more than 3.5 million records. Six out of the seven columns are factor type identifier variables, and the last column stores all the values. But why is it useful? Why should we transform the traditional wide tabular format to the much longer type of data?

For example, we might be interested in comparing the distribution of flight time with the actual elapsed time of the flight, which might not be straightforward to plot with the original data format. Although plotting a scatter plot of the above variables with the ggplot2 package is extremely easy, how would you create two separate boxplots comparing the distributions?

The problem here is that we have two separate variables for the time measurements, while ggplot requires one numeric and one factor variable, from which the latter will be used to provide the labels on the x-axis. For this end, let's restructure our dataset with melt by specifying the two numeric variables to treat as measurement variables and dropping all other columns— or in other words, not having any identifier variables:

> hflights_melted <- melt(hflights, id.vars = 0,
+   measure.vars = c('ActualElapsedTime', 'AirTime'))
> str(hflights_melted)
'data.frame':  454992 obs. of  2 variables:
 $ variable: Factor w/ 2 levels "ActualElapsedTime",..: 1 1 1 1 1 ...
 $ value   : int  60 60 70 70 62 64 70 59 71 70 ...

Note

In general, it's not a good idea to melt a dataset without identifier variables, as casting it later becomes cumbersome, if not impossible.

Please note that now we have exactly twice as many rows than we had before, and the variable column is a factor with only two levels, which represent the two measurement variables. And this resulting data.frame is now easy to plot with the two newly created columns:

> library(ggplot2)
> ggplot(hflights_melted, aes(x = variable, y = value)) +
+   geom_boxplot()
Converting wide tables to the long table format

Well, the previous example might not seem mission critical, and to be honest, I first used the reshape package when I needed some similar transformation to be able to produce some nifty ggplot2 charts—as the previous problem simply does not exist if someone is using base graphics. For example, you can simply pass the two separate variables of the original dataset to the boxplot function.

So, this is kind of entering the world of Hadley Wickham's R packages, and the journey indeed offers some great data analysis practices. Thus, I warmly suggest reading further, for example, on how using ggplot2 is not easy, if not impossible, without knowing how to reshape datasets efficiently.

Converting long tables to the wide table format

Casting a dataset is the opposite of melting, like turning key-value pairs into a tabular data format. But bear in mind that the key-value pairs can always be combined together in a variety of ways, so this process can result in extremely diversified outputs. Thus, you need a table and a formula to cast, for example:

> hflights_melted <- melt(hflights, id.vars = 'Month',
+   measure.vars = c('ActualElapsedTime', 'AirTime'))
> (df <- dcast(hflights_melted, Month ~ variable,
+   fun.aggregate = mean, na.rm = TRUE))
   Month ActualElapsedTime  AirTime
1      1          125.1054 104.1106
2      2          126.5748 105.0597
3      3          129.3440 108.2009
4      4          130.7759 109.2508
5      5          131.6785 110.3382
6      6          130.9182 110.2511
7      7          130.4126 109.2059
8      8          128.6197 108.3067
9      9          128.6702 107.8786
10    10          128.8137 107.9135
11    11          129.7714 107.5924
12    12          130.6788 108.9317

This example shows how to aggregate the measured flight times for each month in 2011 with the help of melting and casting the hflights dataset:

  1. First, we melted the data.frame with the IDs being the Month, where we only kept two numeric variables for the flight times.
  2. Then, we casted the resulting data.frame with a simple formula to show the mean of each month for all measurement variables.

I am pretty sure that now you can quickly restructure this data to be able to plot two separate lines for this basic time-series:

> ggplot(melt(df, id.vars = 'Month')) +
+   geom_line(aes(x = Month, y = value, color = variable)) +
+   scale_x_continuous(breaks = 1:12) +
+   theme_bw() + 
+   theme(legend.position = 'top')
Converting long tables to the wide table format

But of course, melting and casting can be used for a variety of things besides aggregating. For example, we can restructure our original database to have a special Month, which includes all the records of the data. This, of course, doubles the number of rows in the dataset, but also lets us easily generate a table on the data with margins. Here's a quick example:

> hflights_melted <- melt(add_margins(hflights, 'Month'),
+    id.vars = 'Month',
+    measure.vars = c('ActualElapsedTime', 'AirTime'))
> (df <- dcast(hflights_melted, Month ~ variable,
+    fun.aggregate = mean, na.rm = TRUE))
   Month ActualElapsedTime  AirTime
1      1          125.1054 104.1106
2      2          126.5748 105.0597
3      3          129.3440 108.2009
4      4          130.7759 109.2508
5      5          131.6785 110.3382
6      6          130.9182 110.2511
7      7          130.4126 109.2059
8      8          128.6197 108.3067
9      9          128.6702 107.8786
10    10          128.8137 107.9135
11    11          129.7714 107.5924
12    12          130.6788 108.9317
13 (all)          129.3237 108.1423

This is very similar to what we have seen previously, but as an intermediate step, we have converted the Month variable to be factor with a special level, which resulted in the last line of this table. This row represents the overall arithmetic average of the related measure variables.

Tweaking performance

Some further good news on reshape2 is that data.table has decent support for melting and casting, with highly improved performance. Matt Dowle has published some benchmarks with a 5-10 percent improvement in the processing times of using cast and melt on data.table objects instead of the traditional data frames, which is highly impressive.

To verify these results on your own dataset, simply transform the data.frame objects to data.table before calling the reshape2 functions, as the data.table package already ships the appropriate S3 methods to extend reshape2.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.237.79