38 | Big Data Simplied
3.1 INTRODUCTION
There are two ways in which any system that performs computations can be designed. It can
be built as a monolithic system, where data and processes are on a single machine, or it can be
developed as a distributed system comprising of multiple machines, where data and processes
are spread across those machines.
Let us compare this with the choices while selecting the players for a cricket team. A team
can have a star cricketer, who can bat, bowl and field, and is therefore, expected to win every
match for the team. Another option is to put together a team of good players, none of them being
a star all-rounder, but these players work together very well and as a team, they bring in all the
required skills.
Likewise, a computational system can be built in two ways as described above. The monolithic
way is the equivalent of building a cricket team with a single star player, an all-rounder who is
expected to excel in every department of the game, who can be relied on completely. A mono-
lithic system typically is one powerful server, and if there is an increase in demand for computing
capacity, spend more to increase the resources of one server. However, note that the computing
power does not increase in the same proportion as it increases in resources. This is called vertical
scaling and it does not vary linearly with the cost of resources.
On the other hand, a distributed system is the equivalent of a team of good players, none of
them being a star all-rounder, but they work together efficiently. Here, the ownership of improv-
ing performance does not lie with a single machine, but with all the machines. The individual
machines in the distributed system are called nodes, and the entire system is called a cluster.
None of the individual nodes in this cluster is a supercomputer. As a matter of fact, they are typ-
ically cheap machines or commodity hardware. However, these machines can serve together all
the data processing needs in a large scale.
What makes such an arrangement ideal for Big Data processing is the fact that, a cluster of
machines can scale linearly with the amount of data that has to be processed. The capacity of the
cluster increases with the number of machines that are added to the cluster. For example, if the
number of nodes in the cluster is doubled, then obviously the storage capacity will be twice. Each
machine comes with its own hard disk, and doubling the number of machines naturally doubles
the storage capacity. It is imperative to know how such a distributed configuration is attractive,
in addition to linearly increasing storage, when the number of machines in a cluster is doubled,
there is nearly twice the speed of execution, resulting from parallelism. Also, a distributed
system satisfies the third critical requirement in big data processing, i.e., scale and it known as
horizontal scaling. It is the primary reason for corporate giants, such as Google, Facebook, and
Amazon who deal with humongous amount of data to build large data centres populated with
hundreds and thousands of machines processing data in parallel.
As the distributed computing is set up, it needs specialized software that can coordinate all
these servers successfully to solve a data processing problem. This software takes care of the
partitioning data, which means that any piece of data broken up into chunks are in turn stored
across multiple nodes in a cluster. This data should be protected against loss by replicating it on
more than one node. In this configuration where data is spread across multiple nodes, a data-pro-
cessing task will run in parallel on multiple machines. In order to run a process, this software
needs to ensure that a particular node has all the resources in terms of memory, computing
M03 Big Data Simplified XXXX 01.indd 38 5/10/2019 9:57:25 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.136.233.153