Time for action — Map/Reduce via the mongo console

Let's execute the following commands:

$ mongo
MongoDB shell version: 2.0.2
connecting to: test
mongos> use sodibee_development
switched to db sodibee_development
mongos> map = function () {
emit(this.name.toLowerCase()[0], {count:1});
}
mongos> reduce = function (key, values) {
var r = {count:0};
values.forEach(function (value) {
r.count += value.count;
});
return r;
}
mongos> res = db.authors.mapReduce(map, reduce, { out: "authors_dr" } );
mongos> db.authors_dr.find()
{ "_id" : "a", "value" : { "count" : 1020 } }
{ "_id" : "b", "value" : { "count" : 477 } }
{ "_id" : "c", "value" : { "count" : 719 } }
{ "_id" : "d", "value" : { "count" : 586 } }
{ "_id" : "e", "value" : { "count" : 678 } }
{ "_id" : "f", "value" : { "count" : 240 } }
{ "_id" : "g", "value" : { "count" : 396 } }
...

What just happened?

Running a Map/Reduce task is about the map function and the reducer. Let's see this in detail:

map = function () {
emit(this.name.toLowerCase()[0], {count:1});

}

This function will be executed for each Author document. It first takes the name and converts it to lowercase. Then, it emits the first character of the name along with the count as 1.

The reduce function looks like the following:

reduce = function (key, values) {
var r = {count:0};
values.forEach(function (value) {
r.count += value.count;

});
return r;
}

The reduce function takes two parameters: the key that was emitted and an array of the values for this particular key.

A map function is executed once for each member of the dataset. In case of reducers however, it is given an array of results emitted by the mapper function as well as the temporary reduced results.

For example, suppose we have 10 authors starting with "a". There would be 10 results emitted by the mappers. However, when the reducer function is called, it would be given the emitted result that is { count: 1} along with a temporary reduced result, {count: 8}.

Tip

It's very important not to assume that the value passed to the reducers is the same as that emitted from the map function. In most cases, it would be different.

This is what the result of the mapReduce function looks like:

mongos> res = db.authors.mapReduce(map, reduce, { out: "authors_dr" } );
{
"result" : "authors_dr",
"shardCounts" : {
"localhost:27025" : {
"input" : 0,
"emit" : 0,
"reduce" : 0,

"output" : 0
},
"sodibee/localhost:27018,localhost:27020,localhost:27019" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 251,
"output" : 26
}
},
"counts" : {
"emit" : NumberLong(10000),
"input" : NumberLong(10000),
"output" : NumberLong(26),
"reduce" : NumberLong(251)
},
"ok" : 1,
"timeMillis" : 980,
"timing" : {
"shards" : 633,
"final" : 346
},
}

As we can see, there are 10,000 emitted results but only 251 reducer invocations!

Note

In a sharded environment, MongoDB automatically distributes the map functions if the input collection is sharded. By default, the output collection of the reduce function is not shared and remains on one of the shards.

It's interesting to note that the request for 10,000 nodes went to only one shard because the data is stored on that node only. If the chunk size increases beyond that value set in the configuration, then it will get sharded.

Implementing this in Ruby is no different from MongoDB. As we have to pass the JavaScript functions to MongoDB, we do it via strings!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.37.214