Performing MapReduce operations

MapReduce is the programming model to perform operations (mainly aggregation) on distributed sets of data across various clusters in different servers. This concept was coined by Google and was used in the Google file system initially and later was adopted by the open source Hadoop project.

MapReduce works by processing the data on each server and then combine it together to form a result set. It actually divides into two operations namely Map and Reduce.

  • Map: This performs the transformation of the elements in the group or individual sequence
  • Reduce: This performs the aggregation and combines the results from Map into a meaningful result set

In RethinkDB, MapReduce queries operate in three steps as follows:

  • Group operation: To process the data into groups. This step is optional
  • Map operation: To transform the data or group of data into a sequence
  • Reduce operation: To aggregate the sequence data to form a resultset

So mainly it is a Group MapReduce (GMR) operation. RethinkDB spread the MapReduce query across various clusters in order to improve efficiency. There is specific command to perform this GMR operation; however RethinkDB has already integrated them internally to some aggregate functions in order to simplify the process.

Let us perform some aggregation operations in RethinkDB.

Grouping the data

To group the data on the basis of field we can use the group() ReQL function. Here is a sample query on our users table to group the data on the basis of name:

rethinkdb.table("users").group("name").run(connection,function(err,cursor) { 
if(err) { 
throw new Error(err); 
  } 
cursor.toArray(function(err,data) { 
console.log(JSON.stringify(data)); 
  }); 
}); 
 
 

Here is the output for the same:

[ 
   { 
      "group":"John", 
      "reduction":[ 
         { 
            "age":24, 
            "id":"664fced5-c7d3-4f75-8086-7d6b6171dedb", 
            "name":"John" 
         }, 
         { 
            "address":{ 
               "address1":"suite 300", 
               "address2":"Broadway", 
               "map":{ 
                  "latitude":"116.4194W", 
                  "longitude":"38.8026N" 
               }, 
               "state":"Navada", 
               "street":"51/A" 
            }, 
            "age":24, 
            "id":"f6f1f0ce-32dd-4bc6-885d-97fe07310845", 
            "name":"John" 
         } 
      ] 
   }, 
   { 
      "group":"Mary", 
      "reduction":[ 
         { 
            "age":32, 
            "id":"c8e12a8c-a717-4d3a-a057-dc90caa7cfcb", 
            "name":"Mary" 
         } 
      ] 
   }, 
   { 
      "group":"Michael", 
      "reduction":[ 
         { 
            "age":28, 
            "id":"4228f95d-8ee4-4cbd-a4a7-a503648d2170", 
            "name":"Michael" 
         } 
      ] 
   } 
] 

If you observe the query response, data is grouped by the name and each group is associated with a document. Every matching data for the group resides under a reduction array. In order to work on each reduction array, you can use ungroup() ReQL function, which in turns takes grouped streams of data and converts it into an array of an object. It's useful to perform the operations such as sorting and so on, .on grouped values.

Counting the data

We can count the number of documents present in the table or a sub document of a document using the count() method. Here is a simple example:

rethinkdb.table("users").count().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

It should return the number of documents present in the table. You can also use it count the sub document by nesting the fields and running count() function at the end.

Sum

We can perform the addition of the sequence of data. If value is passed as an expression then sums it up else searches in the field provided in the query.

For example, to find out the total number of ages of users:

rethinkdb.table("users")("age").sum().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

You can of course use an expression to perform a math operation like this:

rethinkdb.expr([1,3,4,8]).sum().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

This should return 16.

Avg

This performs the average of the given number or searches for the value provided as field in the query. For example, look at the following code:

rethinkdb.expr([1,3,4,8]).avg().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

Min and Max

This finds out the maximum and minimum number provided as an expression or as field.

For example, find out the oldest users in the database we use the following code:

rethinkdb.table("users")("age").max().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

We use the same method to find out the youngest user:

rethinkdb.table("users")("age").min().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

Distinct

Distinct finds and removes the duplicate element from the sequence, just like the SQL one.

For example, in the following code we find a user with a unique name:

rethinkdb.table("users")("name").distinct().run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

It should return an array containing the names as follows:

[ 'John', 'Mary', 'Michael' ] 

Contains

Contains looks for the value in the field and if found return boolean response; true if it contains the value, false otherwise.

For example, we use the following code to find the user whose name contains John;

rethinkdb.table("users")("name").contains("John").run(connection,function(err,data) { 
if(err) { 
throw new Error(err); 
  } 
console.log(data); 
}); 

This should return true.

Map and reduce

Aggregate functions such as count() and sum() already makes use of map and reduce internally, and if required, then group() too. You can of course use them explicitly in order to perform various functions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.37.20