Computing new variables

One of the most trivial actions we usually perform while restructuring a dataset is to create a new variable. For a traditional data.frame, it's as simple as assigning a vector to a new variable of the R object.

Well, this method also works with data.table, but the usage is deprecated due to the fact that there is a much more efficient way of creating one, or even multiple columns in the dataset:

> hflights_dt <- data.table(hflights)
> hflights_dt[, DistanceKMs := Distance / 0.62137]

We have just computed the distances, in kilometers, between the origin and destination airports with a simple division; although all the hardcore users can head for the udunits2 package, which includes a bunch of conversion tools based on Unidata's udunits library.

And as can be seen previously, data.table uses that special := assignment operator inside of the square brackets, which might seem strange at first glance, but you will love it!

Note

The := operator can be more than 500 times faster than the traditional <- assignment, which is based on the official data.table documentation. This speedup is due to not copying the whole dataset into the memory like R used to do before the 3.1 version. Since then, R has used shallow copies, which greatly improved the performance of column updates, but is still beaten by data.table powerful in-place updates.

Compare the speed of how the preceding computation was run with the traditional <- operator and data.table:

> system.time(hflights_dt$DistanceKMs <-
+   hflights_dt$Distance / 0.62137)
   user  system elapsed 
  0.017   0.000   0.016 
> system.time(hflights_dt[, DistanceKMs := Distance / 0.62137])
   user  system elapsed 
  0.003   0.000   0.002

This is impressive, right? But it's worth double checking what we've just done. The first traditional call, of course, create/updates the DistanceKMs variable, but what happens in the second call? The data.table syntax did not return anything (visibly), but in the background, the hflights_dt R object was updated in-place due to the := operator.

Note

Please note that the := operator can produce unexpected results when used inside of knitr, such as returning the data.table visible after the creation of a new variable, or strange rendering of the command when the return is echo = TRUE. As a workaround, Matt Dowle suggests increasing the depthtrigger option of data.table, or one can simply reassign the data.table object with the same name. Another solution might be to use my pander package over knitr. :)

But once again, how was it so fast?

Memory profiling

The magic of the data.table package—besides having more than 50 percent of C code in the sources—is copying objects in memory only if it's truly necessary. This means that R often copies objects in memory while updating those, and data.table tries to keep these resource-hungry actions at a minimal level. Let's verify this by analyzing the previous example with the help of the pryr package, which provides convenient access to some helper functions for memory profiling and understanding R-internals.

First, let's recreate the data.table object and let's take a note of the pointer value (location address of the object in the memory), so that we will be able to verify later if the new variable simply updated the same R object, or if it was copied in the memory while the operation took place:

> library(pryr)
> hflights_dt <- data.table(hflights)
> address(hflights_dt)
[1] "0x62c88c0"

Okay, so 0x62c88c0 refers to the location where hflights_dt is stored at the moment. Now, let's check if it changes due to the traditional assignment operator:

> hflights_dt$DistanceKMs <- hflights_dt$Distance / 0.62137
> address(hflights_dt)
[1] "0x2c7b3a0"

This is definitely a different location, which means that adding a new column to the R object also requires R to copy the whole object in the memory. Just imagine, we now moved 21 columns in memory due to adding another one.

Now, to bring about the usage of := in data.table:

> hflights_dt <- data.table(hflights)
> address(hflights_dt)
[1] "0x8ca2340"
> hflights_dt[, DistanceKMs := Distance / 0.62137]
> address(hflights_dt)
[1] "0x8ca2340"

The location of the R object in the memory did not change! And copying objects in the memory can cost you a lot of resources, thus a lot of time. Take a look at the following example, which is a slightly updated version of the above traditional variable assignment call, but with an added convenience layer of within:

> system.time(within(hflights_dt, DistanceKMs <- Distance / 0.62137))
   user  system elapsed 
  0.027   0.000   0.027

Here, using the within function probably copies the R object once more in the memory, and hence brings about the relatively serious performance overhead. Although the absolute time difference between the preceding examples might not seem very significant (not in the statistical context), but just imagine how the needless memory updates can affect the processing time of your data analysis with some larger datasets!

Creating multiple variables at a time

One nice feature of data.table is the creation of multiple columns with a single command, which can be extremely useful in some cases. For example, we might be interested in the distance of airports in feet as well:

> hflights_dt[, c('DistanceKMs', 'DiastanceFeets') :=
+   list(Distance / 0.62137, Distance * 5280)]

So, it's as simple as providing a character vector of the desired variable names on the left-hand side and the list of appropriate values on the right-hand side of the := operator. This feature can easily be used for some more complex tasks. For example, let's create the dummy variables of the airline carriers:

> carriers <- unique(hflights_dt$UniqueCarrier)
> hflights_dt[, paste('carrier', carriers, sep = '_') :=
+   lapply(carriers, function(x) as.numeric(UniqueCarrier == x))]
> str(hflights_dt[, grep('^carrier', names(hflights_dt)),
+   with = FALSE])
Classes 'data.table' and 'data.frame': 227496 obs. of  15 variables:
 $ carrier_AA: num  1 1 1 1 1 1 1 1 1 1 ...
 $ carrier_AS: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_B6: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_CO: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_DL: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_OO: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_UA: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_US: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_WN: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_EV: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_F9: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_FL: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_MQ: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_XE: num  0 0 0 0 0 0 0 0 0 0 ...
 $ carrier_YV: num  0 0 0 0 0 0 0 0 0 0 ...
 - attr(*, ".internal.selfref")=<externalptr>

Although it's not a one-liner, and it also introduces a helper variable, it's not that complex to see what we did:

  1. First, we saved the unique carrier names in a character vector.
  2. Then, we defined the new variables' name with the help of that.
  3. We iterated our anonymous function over this character vector as well, to return TRUE or FALSE if the carrier name matched the given column.
  4. The given column was converted to 0 or 1 through as.numeric.
  5. And then, we simply checked the structure of all columns whose names start with carrier.

This is not perfect, as we usually leave out one label from the dummy variables to reduce redundancy. In the current situation, the last new column is simply the linear combination of the other newly created columns, thus information is duplicated. For this end, it's usually a good practice to leave out, for example, the last category by passing -1 to the n argument in the head function.

Computing new variables with dplyr

The usage of mutate from the dplyr package is identical to that of the base within function, although mutate is a lot quicker than within:

> hflights <- hflights %>%
+     mutate(DistanceKMs = Distance / 0.62137)

If the analogy of mutate and within has not been made straightforward by the previous example, it's probably also useful to show the same example without using pipes:

> hflights <- mutate(hflights, DistanceKMs = Distance / 0.62137)
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.176.80