Let's execute the following commands:
$ mongo
MongoDB shell version: 2.0.2
connecting to: test
mongos> use sodibee_development
switched to db sodibee_development
mongos> map = function () {
emit(this.name.toLowerCase()[0], {count:1});
}
mongos> reduce = function (key, values) {
var r = {count:0};
values.forEach(function (value) {
r.count += value.count;
});
return r;
}
mongos> res = db.authors.mapReduce(map, reduce, { out: "authors_dr" } );
mongos> db.authors_dr.find()
{ "_id" : "a", "value" : { "count" : 1020 } }
{ "_id" : "b", "value" : { "count" : 477 } }
{ "_id" : "c", "value" : { "count" : 719 } }
{ "_id" : "d", "value" : { "count" : 586 } }
{ "_id" : "e", "value" : { "count" : 678 } }
{ "_id" : "f", "value" : { "count" : 240 } }
{ "_id" : "g", "value" : { "count" : 396 } }
...
Running a Map/Reduce task is about the map
function and the reducer. Let's see this in detail:
map = function () {
emit(this.name.toLowerCase()[0], {count:1});
}
This function will be executed for each Author
document. It first takes the name and converts it to lowercase. Then, it emits the first character of the name along with the count as 1
.
The reduce
function looks like the following:
reduce = function (key, values) {
var r = {count:0};
values.forEach(function (value) {
r.count += value.count;
});
return r;
}
The reduce
function takes two parameters: the key
that was emitted and an array of the values
for this particular key.
A map
function is executed once for each member of the dataset. In case of reducers however, it is given an array of results emitted by the mapper function as well as the temporary reduced results.
For example, suppose we have 10 authors starting with "a". There would be 10 results emitted by the mappers. However, when the reducer function is called, it would be given the emitted result that is { count: 1}
along with a temporary reduced result, {count: 8}
.
It's very important not to assume that the value passed to the reducers is the same as that emitted from the map function. In most cases, it would be different.
This is what the result of the mapReduce
function looks like:
mongos> res = db.authors.mapReduce(map, reduce, { out: "authors_dr" } ); { "result" : "authors_dr", "shardCounts" : { "localhost:27025" : { "input" : 0, "emit" : 0, "reduce" : 0, "output" : 0 }, "sodibee/localhost:27018,localhost:27020,localhost:27019" : { "input" : 10000, "emit" : 10000, "reduce" : 251, "output" : 26 } }, "counts" : { "emit" : NumberLong(10000), "input" : NumberLong(10000), "output" : NumberLong(26), "reduce" : NumberLong(251) }, "ok" : 1, "timeMillis" : 980, "timing" : { "shards" : 633, "final" : 346 }, }
As we can see, there are 10,000 emitted results but only 251 reducer invocations!
In a sharded environment, MongoDB automatically distributes the map
functions if the input collection is sharded. By default, the output collection of the reduce
function is not shared and remains on one of the shards.
It's interesting to note that the request for 10,000 nodes went to only one shard because the data is stored on that node only. If the chunk size increases beyond that value set in the configuration, then it will get sharded.
Implementing this in Ruby is no different from MongoDB. As we have to pass the JavaScript functions to MongoDB, we do it via strings!
18.118.37.214