Rearranging data

Sometimes, we do not want to filter any part of the data (neither the rows, nor the columns), but the data is simply not in the most useful order due to convenience or performance issues, as we have seen, for instance, in Chapter 3, Filtering and Summarizing Data.

Besides the base sort and order functions, or providing the order of variables passed to the [ operator, we can also use some SQL-like solutions with the sqldf package, or query the data in the right format directly from the database. And the previously mentioned dplyr package also provides an effective method for ordering data. Let's sort the hflights data, based on the actual elapsed time for each of the quarter million flights:

> str(arrange(hflights, ActualElapsedTime))
'data.frame':  227496 obs. of  21 variables:
 $ Year             : int  2011 2011 2011 2011 2011 2011 ...
 $ Month            : int  7 7 8 9 1 4 5 6 7 8 ...
 $ DayofMonth       : int  24 25 13 21 3 29 9 21 8 2 ...
 $ DayOfWeek        : int  7 1 6 3 1 5 1 2 5 2 ...
 $ DepTime          : int  2005 2302 1607 1546 1951 2035 ...
 $ ArrTime          : int  2039 2336 1641 1620 2026 2110 ...
 $ UniqueCarrier    : chr  "WN" "XE" "WN" "WN" ...
 $ FlightNum        : int  1493 2408 912 2363 2814 2418 ...
 $ TailNum          : chr  "N385SW" "N12540" "N370SW" "N524SW" ...
 $ ActualElapsedTime: int  34 34 34 34 35 35 35 35 35 35 ...
 $ AirTime          : int  26 26 26 26 23 23 27 26 25 25 ...
 $ ArrDelay         : int  9 -8 -4 15 -19 20 35 -15 86 -9 ...
 $ DepDelay         : int  20 2 7 26 -4 35 45 -8 96 1 ...
 $ Origin           : chr  "HOU" "IAH" "HOU" "HOU" ...
 $ Dest             : chr  "AUS" "AUS" "AUS" "AUS" ...
 $ Distance         : int  148 140 148 148 127 127 148 ...
 $ TaxiIn           : int  3 3 4 3 4 4 5 3 5 4 ...
 $ TaxiOut          : int  5 5 4 5 8 8 3 6 5 6 ...
 $ Cancelled        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ CancellationCode : chr  "" "" "" "" ...
 $ Diverted         : int  0 0 0 0 0 0 0 0 0 0 ...

Well, it's pretty straightforward that flights departing to Austin are among the first few records shown. For improved readability, the above three R expressions can be called in a much nicer way with the pipe operator from the automatically imported magrittr package, which provides a simple way to pass an R object as the first argument of the subsequent R expression:

> hflights %>% arrange(ActualElapsedTime) %>% str

So, instead of nesting R functions, we can now start our R command with the core object and pass the results of each evaluated R expression to the next one in the chain. In most cases, this makes the code more convenient to read. Although most hardcore R programmers have already gotten used to reading the nested function calls from inside-out, believe me, it's pretty easy to get used to this nifty feature! Do not let me confuse you with the inspiring painting of René Magritte, which became the slogan, "This is not a pipe," and a symbol of the magrittr package:

Rearranging data

There is no limit to the number of chainable R expressions and objects one can have. For example, let's also filter a few cases and variables to see how easy it is to follow the data restructuring steps with dplyr:

> hflights %>%
+     arrange(ActualElapsedTime) %>%
+     select(ActualElapsedTime, Dest) %>%
+     subset(Dest != 'AUS') %>%
+     head %>%
+     str
'data.frame':  6 obs. of  2 variables:
 $ ActualElapsedTime: int  35 35 36 36 37 37
 $ Dest             : chr  "LCH" "LCH" "LCH" "LCH" ...

So, now we have filtered the original dataset a few times to see the closest airport after Austin, and the code is indeed easy to read and understand. This is a nice and efficient way to filter data, although some prefer to use nifty one-liners with the data.table package:

> str(head(data.table(hflights, key = 'ActualElapsedTime')[Dest !=
+   'AUS', c('ActualElapsedTime', 'Dest'), with = FALSE]))
Classes 'data.table' and 'data.frame':  6 obs. of  2 variables:
 $ ActualElapsedTime: int  NA NA NA NA NA NA
 $ Dest             : chr  "MIA" "DFW" "MIA" "SEA" ...
 - attr(*, "sorted")= chr "ActualElapsedTime"
 - attr(*, ".internal.selfref")=<externalptr>

Almost perfect! The only problem is that we got different results due to the missing values, which were ordered at the beginning of the dataset while we defined the data.table object to be indexed by ActualElapsedTime. To overcome this issue, let's drop the NA values, and instead of specifying the column names as strings along with forcing the with parameter to be FALSE, let's pass a list of column names:

> str(head(na.omit(
+   data.table(hflights, key = 'ActualElapsedTime'))[Dest != 'AUS',
+     list(ActualElapsedTime, Dest)]))
Classes 'data.table' and 'data.frame':  6 obs. of  2 variables:
 $ ActualElapsedTime: int  35 35 36 36 37 37
 $ Dest             : chr  "LCH" "LCH" "LCH" "LCH" ...
 - attr(*, "sorted")= chr "ActualElapsedTime"
 - attr(*, ".internal.selfref")=<externalptr>

This is exactly the same results as we have seen before. Please note that in this example, we have omitted the NA values after transforming data.frame to data.table, indexed by the ActualElapsedTime variable, which is a lot faster compared to calling na.omit on hflights first and then evaluating all the other R expressions:

> system.time(str(head(data.table(na.omit(hflights),
+   key = 'ActualElapsedTime')[Dest != 'AUS',
+     c('ActualElapsedTime', 'Dest'), with = FALSE])))
   user  system elapsed 
  0.374   0.017   0.390 
> system.time(str(head(na.omit(data.table(hflights,
+   key = 'ActualElapsedTime'))[Dest != 'AUS',
+     c('ActualElapsedTime', 'Dest'), with = FALSE])))
   user  system elapsed 
   0.22    0.00    0.22
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.189.251