Using rlist to work with nested data structures

In the previous chapter, you learned about both relational databases that store data in tables and non-relational databases that support nested data structures. In R, the most commonly used nested data structure is a list object. All previous sections focus on manipulating tabular data. In this section, let's play with the rlist package I developed, which is designed for manipulating non-tabular data.

The design of rlist is very similar to dplyr. It provides mapping, filtering, selecting, sorting, grouping, and aggregating functionality for list objects. Run the following code to install the rlist package from CRAN:

install.packages("rlist") 

We have the non-tabular version of the product data stored in data/products.json. In this file, each product has a JSON representation as follows:

{ 
    "id": "T01", 
    "name": "SupCar", 
    "type": "toy", 
    "class": "vehicle", 
    "released": true, 
    "stats": { 
      "material": "Metal", 
      "size": 120, 
      "weight": 10 
    }, 
    "tests": { 
      "quality": null, 
      "durability": 10, 
      "waterproof": false 
    }, 
    "scores": [8, 9, 10, 10, 6, 5] 
  } 

All products are stored in an JSON array like [ {...}, {...} ]. Instead of storing data in different tables, we put everything relating to a product in one object. To work with data in this format, we can use rlist functions. First, let's load the rlist package:

library(rlist) 

To load the data into R as a list, we can use jsonlite::fromJSON() or simply list.load() provided by rlist:

products <- list.load("data/products.json") 
str(products[[1]]) 
## List of 8 
##  $ id      : chr "T01" 
##  $ name    : chr "SupCar" 
##  $ type    : chr "toy" 
##  $ class   : chr "vehicle" 
##  $ released: logi TRUE 
##  $ stats   :List of 3 
##   ..$ material: chr "Metal" 
##   ..$ size    : int 120 
##   ..$ weight  : int 10 
##  $ tests   :List of 3 
##   ..$ quality   : NULL 
##   ..$ durability: int 10 
##   ..$ waterproof: logi FALSE 
##  $ scores  : int [1:6] 8 9 10 10 6 5 

Now, products contains the information of all products. Each element of products represents a product with all related information.

To evaluate an expression within the context of each element, we can call list.map():

str(list.map(products, id)) 
## List of 6 
##  $ : chr "T01" 
##  $ : chr "T02" 
##  $ : chr "M01" 
##  $ : chr "M02" 
##  $ : chr "M03" 
##  $ : chr "M04" 

It iteratively evaluates id on each element of products and returns a new list containing all the corresponding results. The list.mapv() function simplifies the list and only returns a vector:

list.mapv(products, name) 
## [1] "SupCar"    "SupPlane"  "JeepX"     "AircraftX" 
## [5] "Runner"    "Dancer" 

To filter products, we can call list.filter() with logical conditions. All elements of products for which the conditions yield TRUE will be picked out:

released_products <- list.filter(products, released) 
list.mapv(released_products, name) 
## [1] "SupCar"    "JeepX"     "AircraftX" "Runner" 

Note that rlist functions have design similar to dplyr functions, that is, the input data is always the first argument. We can, thus, use a pipeline operator to pipe the results forward:

products %>% 
  list.filter(released) %>% 
  list.mapv(name) 
## [1] "SupCar"    "JeepX"     "AircraftX" "Runner" 

We can use list.select() to select the given fields of each element of the input list:

products %>% 
  list.filter(released, tests$waterproof) %>% 
  list.select(id, name, scores) %>% 
  str() 
## List of 3 
##  $ :List of 3 
##   ..$ id    : chr "M01" 
##   ..$ name  : chr "JeepX" 
##   ..$ scores: int [1:6] 6 8 7 9 8 6 
##  $ :List of 3 
##   ..$ id    : chr "M02" 
##   ..$ name  : chr "AircraftX" 
##   ..$ scores: int [1:7] 9 9 10 8 10 7 9 
##  $ :List of 3 
##   ..$ id    : chr "M03" 
##   ..$ name  : chr "Runner" 
##   ..$ scores: int [1:10] 6 7 5 6 5 8 10 9 8 9 

Alternatively, we can make new fields in list.select() based on the existing fields:

products %>% 
  list.filter(mean(scores) >= 8) %>% 
  list.select(name, scores, mean_score = mean(scores)) %>% 
  str() 
## List of 3 
##  $ :List of 3 
##   ..$ name      : chr "SupCar" 
##   ..$ scores    : int [1:6] 8 9 10 10 6 5 
##   ..$ mean_score: num 8 
##  $ :List of 3 
##   ..$ name      : chr "SupPlane" 
##   ..$ scores    : int [1:5] 9 9 10 10 10 
##   ..$ mean_score: num 9.6 
##  $ :List of 3 
##   ..$ name      : chr "AircraftX" 
##   ..$ scores    : int [1:7] 9 9 10 8 10 7 9 
##   ..$ mean_score: num 8.86 

We can also sort the list elements by certain fields or values using list.sort() and stack all elements into a data frame using list.stack():

products %>% 
  list.select(name, mean_score = mean(scores)) %>% 
  list.sort(-mean_score) %>% 
  list.stack() 
##        name mean_score 
## 1  SupPlane   9.600000 
## 2 AircraftX   8.857143 
## 3    SupCar   8.000000 
## 4    Dancer   7.833333 
## 5     JeepX   7.333333 
## 6    Runner   7.300000 

To group a list, we will call list.group() to make a nested list in which all elements are divided by the values of the field:

products %>% 
  list.select(name, type, released) %>% 
  list.group(type) %>% 
  str() 
## List of 2 
##  $ model:List of 4 
##   ..$ :List of 3 
##   .. ..$ name    : chr "JeepX" 
##   .. ..$ type    : chr "model" 
##   .. ..$ released: logi TRUE 
##   ..$ :List of 3 
##   .. ..$ name    : chr "AircraftX" 
##   .. ..$ type    : chr "model" 
##   .. ..$ released: logi TRUE 
##   ..$ :List of 3 
##   .. ..$ name    : chr "Runner" 
##   .. ..$ type    : chr "model" 
##   .. ..$ released: logi TRUE 
##   ..$ :List of 3 
##   .. ..$ name    : chr "Dancer" 
##   .. ..$ type    : chr "model" 
##   .. ..$ released: logi FALSE 
##  $ toy  :List of 2 
##   ..$ :List of 3 
##   .. ..$ name    : chr "SupCar" 
##   .. ..$ type    : chr "toy" 
##   .. ..$ released: logi TRUE 
##   ..$ :List of 3 
##   .. ..$ name    : chr "SupPlane" 
##   .. ..$ type    : chr "toy" 
##   .. ..$ released: logi FALSE 

The rlist function also provides many other functions that try to make non-tabular data manipulation easier. For example, list.table() enhances table() to directly work with a list of elements:

products %>% 
  list.table(type, class) 
##        class 
## type    people vehicle 
##   model      2       2 
##   toy        0       2 

It also supports multi-dimensional tables by evaluating each argument in the context of the input list:

products %>% 
  list.filter(released) %>% 
  list.table(type, waterproof = tests$waterproof) 
##        waterproof 
## type    FALSE TRUE 
##   model     0    3 
##   toy       1    0 

Although the storage of data is non-tabular, we can easily perform comprehensive data manipulation and get the results presented in the tabular form. For example, suppose we need to compute the mean score and number of scores of the top two products with the highest mean scores but also with at least five scores.

We can decompose such a task into smaller data manipulation subtasks, which can be easily done by rlist functions. Due to the number of steps involved in the data operations, we will use pipeline to organize the workflow:

products %>% 
  list.filter(length(scores) >= 5) %>% 
  list.sort(-mean(scores)) %>% 
  list.take(2) %>% 
  list.select(name,  
    mean_score = mean(scores), 
    n_score = length(scores)) %>% 
  list.stack() 
##        name mean_score n_score 
## 1  SupPlane   9.600000       5 
## 2 AircraftX   8.857143       7 

The code looks straightforward, and it is easy to predict or analyze what happens in each step. If the final result can be represented in the tabular form, we can call list.stack() to bind all list elements together into a data frame.

To learn more about rlist functions, read the rlist tutorial (https://renkun.me/rlist-tutorial/). There are other packages that deal with nested data structures but may have different philosophy, such as purrr (https://github.com/hadley/purrr). If you are interested, visit and learn more on their websites.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.58.194