Using data.table for data manipulation

One of the most efficient packages for data mining and manipulation in R is data.table. Developed by Matt Dowle and Arun Srinivasan, data.table has consistently outperformed other contemporary R packages in general day-to-day data analysis operations.

The only caveat to using data.table is the fact that its behavior is slightly different from data.frame in terms of the syntax used for subsetting and other operations. That said, the benefits of using data.table greatly outweighs the slightly extra effort required to learn the package.

data.table can be installed using the following code:

install.packages("data.table")
library(data.table)

The general form of data.table operations is as follows:

dt[i, j, by]

And the following applies:

  • dt is the name of data.table
  • i is the condition or rows by which data.table is being subset
  • j represents the calculations or columns to be produced
  • by represents the group-by aggregates

We can create a new data.table using the data.table function as follows:

dstate <- data.table(state.x77,State=row.names(state.x77)) 

Note that the column, state, has been added as data.table do not have row names (uses row indices instead).

An interesting aspect of data.table is that operations can occur by reference without the need for copying the data. In base R, operations such as renaming columns may require copying the entire data.frame. By avoiding such steps and along with several other optimizations, data.table provides an immense improvement in performance over most of the other data manipulation solutions in R today.

To select the first three rows, use the following:

dstate[3:5] 

To select the rows where Income > 5000 use the following:

dstate[Income > 5000] 

To select the rows where Income > 5000 & `HS Grad` > 60 use the following:

dstate[Income > 5000 & `HS Grad` > 60] 

To select the columns Population, Income, Frost, and State where Income > 5000, use the following:

dstate[Income > 5000, list(Population, Income, Frost, State)] 

Note that we can also use the . notation in order to return the results as data.table, as shown:

dstate[Income > 5000, .(Population, Income, Frost, State)] 

As stated before, the j value can also be used in order to perform calculations. For example, if instead of just returning the individual values for Population, Income, and Frost we wanted to get the mean of each, we can instead use the following:

dstate[Income > 5000, .(Mean_Pop=mean(Population), Mean_Inc=mean(Income), Mean_Frost=mean(Frost))] 

The .N notation can be used in order to get counts, as follows:

dstate[Income > 5000, .(Count=.N, Mean_Pop=mean(Population), Mean_Inc=mean(Income), Mean_Frost=mean(Frost))]  

Note that this manner of making selections is different from that in data.frame where variables are quoted. In data.table, we can use the variable names as is. There is a way by which we can also refer to the columns or subset it using the data.frame [row,column] method by using with.

For example, to select the columns, Population, Income, and State where Income > 5000 using the data.frame method, we can use the following:

dstate[Income > 5000, c("Population","Income","State"), with=F] 

We can also use : (colon) to select a range of columns. For instance, to select the first three rows of all of the columns from HS Grad to State, we can use the following:

dstate[1:3,'HS Grad':State]  
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.78.137