Installing Hadoop plus Spark clusters

Before installing Hadoop and Spark, let's understand the versions of Hadoop and Spark. Spark is offered as a service in all three popular Hadoop distributions from Cloudera, Hortonworks, and MapR. The current Hadoop and Spark versions are 2.7.2 and 2.0 respectively as of writing this book. However, Hadoop distributions might have a lower version of Spark as Hadoop and Spark release cycles do not coincide.

For the upcoming chapters' practical exercises, let's use one of the free virtual machines (VM) from Cloudera, Hortonworks, and MapR, or use an open source version of Apache Spark. These VMs makes it easy to get started with Spark and Hadoop. The same exercises can be run on bigger clusters as well.

The prerequisites to use virtual machines on your laptop are as follows:

  • RAM of 8 GB and above
  • At least two virtual CPUs
  • The latest VMWare Player or Oracle VirtualBox must be installed for Windows or Linux OS
  • The latest Oracle VirtualBox or VMWare Fusion for Mac
  • Virtualization is enabled in BIOS
  • Chrome 25+, IE 9+, Safari 6+, or Firefox 18+ is recommended (HDP Sandbox will not run on IE 10)
  • Putty
  • WinSCP

The instructions to download and run Cloudera Distribution for Hadoop (CDH) are as follows:

  1. Download the latest quickstart CDH VM from http://www.cloudera.com/content/www/en-us/downloads.html. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
  2. Extract it to a directory (use 7-Zip or WinZip).
  3. In case of VMWare Player, click on Open a Virtual Machine, and point to the directory where you have extracted the VM. Select the cloudera-quickstart-vm-5.x.x-x-vmware.vmx file and click on Open.
  4. Click on Edit virtual machine settings and then increase memory to 7 GB (if your laptop has 8 GB RAM) or 8 GB (if your laptop has more than 8 GB RAM). Increase the number of processors to four. Click on OK.
  5. Click on Play virtual machine.
  6. Select I copied it and click on OK.
  7. This should get your VM up and running.
  8. Cloudera Manager is installed on the VM but is turned off by default. If you would like to use Cloudera Manager, double-click and run Launch Cloudera Manager Express to set up Cloudera Manager. This will be helpful in the starting / stopping / restarting of services on the cluster.
  9. Credentials for the VM are username (cloudera) and password (cloudera).

If you would like to use the Cloudera Quickstart Docker image, follow the instructions on http://blog.cloudera.com/blog/2015/12/docker-is-the-new-quickstart-option-for-apache-hadoop-and-cloudera.

The instructions to download and run Hortonworks Data Platform (HDP) Sandbox are as follows:

  1. Download the latest HDP Sandbox from http://hortonworks.com/products/hortonworks-sandbox/#install. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
  2. Follow the instructions from install guides on the same downloads page.
  3. Open the browser and enter the address as shown in sandbox, for example, http://192.168.139.158/. Click on View Advanced Options to see all the links.
  4. Access the sandbox with putty as the root user and hadoop as the initial password. You need to change the password on the first login. Also, run the ambari-admin-password-reset command to reset Ambari admin password.
  5. To start using Ambari, open the browser and enter ipaddressofsandbox:8080 with admin credentials created in the preceding step. Start the services needed in Ambari.
  6. To map the hostname to the IP address in Windows, go to C:WindowsSystem32driversetchosts and enter the IP address and hostname with a space separator. You need admin rights to do this.

The instructions to download and run MapR Sandbox are as follows:

  1. Download the latest sandbox from https://www.mapr.com/products/mapr-sandbox-hadoop/download. Download the appropriate version based on the virtualization software (VirtualBox or VMWare) installed on the laptop.
  2. Follow the instructions to set up Sandbox at http://doc.mapr.com/display/MapR/MapR+Sandbox+for+Hadoop.
  3. Use Putty to log in to the sandbox.
  4. The root password is mapr.
  5. To launch HUE or MapR Control System (MCS), navigate to the URL provided by MapR Sandbox.
  6. To map the hostname to the IP address in Windows, go to C:WindowsSystem32driversetchosts and enter the IP address and hostname with a space separator.

The instructions to download and run Apache Spark prebuilt binaries, in case you have a preinstalled Hadoop cluster, are given here. The following instructions can also be used to install the latest version of Spark and use it on the preceding VMs:

  1. Download Spark prebuilt for Hadoop from the following location:
    wget  http://apache.mirrors.tds.net/spark/spark-2.0.0/spark-2.0.0-bin-hadoop2.7.tgz
    tar xzvf spark-2.0.0-bin-hadoop2.7.tgz
    cd spark-2.0.0-bin-hadoop2.7
    
  2. Add SPARK_HOME and PATH variables to the profile script as shown in the following commands so that these environment variables will be set every time you log in:
    [cloudera@quickstart ~]$ cat /etc/profile.d/spark2.sh
    export SPARK_HOME=/home/cloudera/spark-2.0.0-bin-hadoop2.7
    export PATH=$PATH:/home/cloudera/spark-2.0.0-bin-hadoop2.7/bin
    
  3. Let Spark know about the Hadoop configuration directory and Java home by adding the following environment variables to spark-env.sh. Copy the template files in the conf directory:
    cp conf/spark-env.sh.template conf/spark-env.sh
    cp conf/spark-defaults.conf.template conf/spark-defaults.conf
    vi conf/spark-env.sh
    export HADOOP_CONF_DIR=/etc/hadoop/conf
    export JAVA_HOME=/usr/java/jdk1.7.0_67-cloudera
    
  4. Copy hive-site.xml to the conf directory of Spark.
    cp /etc/hive/conf/hive-site.xml conf/
    
  5. Change the log level to ERROR in the spark-2.0.0-bin-hadoop2.7/conf/log4j.properties file after copying the template file.

Tip

Programming languages version requirements to run Spark:

Java: 7+

Python: 2.6+/3.1+

R: 3.1+

Scala: Spark 1.6 and below 2.10, and Spark 2.0 and above 2.11

Note that the preceding virtual machines are single node clusters. If you are planning to set up multi-node clusters, follow the guidelines as per the distribution, such as CDH, HDP, or MapR. If you are planning to use a standalone cluster manager, the setup is described in the following chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.168.203