A standard Hadoop architecture 

Let's understand a standard Hadoop architecture:

  • Hadoop File System (HDFS): A distributed filesystem instantiated across a set of local disks attached to the compute nodes in the Hadoop cluster
  • Map: The embarrassingly parallel computation that is applied to every chunk of data read from HDFS (in parallel)
  • Reduce: The phase that takes map results and combines them to perform the final computation

The final results are typically stored back into HDFS. The benefits of Serengeti (open source project) provide ease of provisioning, multi-tenancy, and flexibility to scale up or out. BDE allows Serengeti to be triggered from a vRealize blueprint, making it easy to self-provision a Hadoop cluster of a given size:

The preceding diagram shows a virtualized Hadoop environment. Local disks are made available as VMDKs to the guest, Map, and Reduce tasks running in VMs on each Hadoop cluster node.

The next generation of our approach is one in which there are two types of VMs: Compute nodes and Data nodes. The Data node is responsible for managing the physical disks attached to the host and for HDFS. The Compute nodes run the Map and Reduce tasks. Communication between the Compute node and Data node happens through fast VM-VM communication.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.129.19