MapReduce is the programming model to perform operations (mainly aggregation) on distributed sets of data across various clusters in different servers. This concept was coined by Google and was used in the Google file system initially and later was adopted by the open source Hadoop project.
MapReduce works by processing the data on each server and then combine it together to form a result set. It actually divides into two operations namely Map and Reduce.
In RethinkDB, MapReduce queries operate in three steps as follows:
So mainly it is a Group MapReduce (GMR) operation. RethinkDB spread the MapReduce query across various clusters in order to improve efficiency. There is specific command to perform this GMR operation; however RethinkDB has already integrated them internally to some aggregate functions in order to simplify the process.
Let us perform some aggregation operations in RethinkDB.
To group the data on the basis of field we can use the group()
ReQL function. Here is a sample query on our users table to group the data on the basis of name:
rethinkdb.table("users").group("name").run(connection,function(err,cursor) { if(err) { throw new Error(err); } cursor.toArray(function(err,data) { console.log(JSON.stringify(data)); }); });
Here is the output for the same:
[ { "group":"John", "reduction":[ { "age":24, "id":"664fced5-c7d3-4f75-8086-7d6b6171dedb", "name":"John" }, { "address":{ "address1":"suite 300", "address2":"Broadway", "map":{ "latitude":"116.4194W", "longitude":"38.8026N" }, "state":"Navada", "street":"51/A" }, "age":24, "id":"f6f1f0ce-32dd-4bc6-885d-97fe07310845", "name":"John" } ] }, { "group":"Mary", "reduction":[ { "age":32, "id":"c8e12a8c-a717-4d3a-a057-dc90caa7cfcb", "name":"Mary" } ] }, { "group":"Michael", "reduction":[ { "age":28, "id":"4228f95d-8ee4-4cbd-a4a7-a503648d2170", "name":"Michael" } ] } ]
If you observe the query response, data is grouped by the name and each group is associated with a document. Every matching data for the group resides under a reduction
array. In order to work on each reduction
array, you can use ungroup()
ReQL function, which in turns takes grouped streams of data and converts it into an array of an object. It's useful to perform the operations such as sorting and so on, .on grouped values.
We can count the number of documents present in the table or a sub document of a document using the count()
method. Here is a simple example:
rethinkdb.table("users").count().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
It should return the number of documents present in the table. You can also use it count the sub document by nesting the fields and running count()
function at the end.
We can perform the addition of the sequence of data. If value is passed as an expression then sums it up else searches in the field provided in the query.
For example, to find out the total number of ages of users:
rethinkdb.table("users")("age").sum().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
You can of course use an expression to perform a math operation like this:
rethinkdb.expr([1,3,4,8]).sum().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
This should return 16
.
This performs the average of the given number or searches for the value provided as field in the query. For example, look at the following code:
rethinkdb.expr([1,3,4,8]).avg().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
This finds out the maximum and minimum number provided as an expression or as field.
For example, find out the oldest users in the database we use the following code:
rethinkdb.table("users")("age").max().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
We use the same method to find out the youngest user:
rethinkdb.table("users")("age").min().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
Distinct finds and removes the duplicate element from the sequence, just like the SQL one.
For example, in the following code we find a user with a unique name:
rethinkdb.table("users")("name").distinct().run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
It should return an array containing the names as follows:
[ 'John', 'Mary', 'Michael' ]
Contains looks for the value in the field and if found return boolean response; true
if it contains the value, false
otherwise.
For example, we use the following code to find the user whose name contains John
;
rethinkdb.table("users")("name").contains("John").run(connection,function(err,data) { if(err) { throw new Error(err); } console.log(data); });
This should return true
.
3.138.37.20