MongoDB as a data warehouse

Apache Hadoop is often described as the 800 lb gorilla in the room of big data frameworks. Apache Spark on the other hand, is more like a 200 lb cheetah for its speed and agility and performance characteristics, which allows it to work well in a subset of the problems Hadoop aims to solve.

MongoDB, on the other hand, can be described as the MySQL equivalent in the NoSQL world because of its adoption and ease of use. MongoDB also offers an aggregation framework, MapReduce capabilities, and horizontal scaling using sharding, which is essentially data partitioning at the database level.

So naturally, some people have been asking, Why not use MongoDB as our data warehouse, simplifying our architecture?

This is a pretty compelling argument and it may or may not be the case that it makes sense to use MongoDB as a data warehouse.

The advantages of such a decision are:

  • Simpler architecture
  • Less need for message queues, reducing latency in our system

The disadvantages:

  • MongoDB's MapReduce framework is not a replacement for Hadoop's MapReduce. Even though they both follow the same philosophy, Hadoop can scale to accommodate larger workloads.
  • Scaling MongoDB's document storage using sharding will hit a wall at some point. Whereas Yahoo! has reported using 42,000 servers at its largest Hadoop cluster, the largest MongoDB commercial deployments stand at 5 billion (Craigslist) compared to 600 nodes and PBs of data for Baidu, the internet giant dominating, among others, the Chinese internet search market.

There is more than an order of magnitude of difference in terms of scaling:

  • MongoDB is mainly designed around being a real-time querying database based on stored data on disk, whereas MapReduce is designed around using batches and Spark is designed around using streams of data
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.55.151