Chapter 3. Deep Dive into Apache Spark

Apache Spark is growing at a fast pace in terms of technology, community, and user base. Two new APIs were introduced in 2015: the DataFrame API and DataSet API. These two APIs are built on top of the core API, which is based on RDDs. It is essential to understand the deeper concepts of RDDs including runtime architecture and behavior on various resource managers of Spark.

This chapter is divided into the following sub topics:

  • Starting Spark daemons
  • Spark core concepts
  • Pairing RDDs
  • The lifecycle of a Spark program
  • Spark applications
  • Persistence and caching
  • Spark resource managers—Standalone, Yarn, and Mesos

Starting Spark daemons

If you are planning to use a standalone cluster manager, you need to start the Spark master and worker daemons which are the core components in Spark's architecture. Starting/stopping daemons varies slightly from distribution to distribution. Hadoop distributions such as Cloudera, Hortonworks, and MapR provide Spark as a service with YARN as the default resource manager. This means that all Spark applications will run on the YARN framework by default. But, we need to start spark master and worker roles to use Spark's standalone resource manager. If you are planning to use the YARN resource manager, you don't need to start these daemons. Please follow the following procedure depending on the type of distribution you are using. Downloading and installation instructions can be found in Chapter 2, Getting Started with Apache Hadoop and Apache Spark, for all these distributions.

Working with CDH

Cloudera Distribution for Hadoop (CDH) is an open source distribution including Hadoop, Spark, and many other projects needed for Big Data Analytics. Cloudera Manager is used for installing and managing the CDH platform. If you are planning to use the YARN resource manager, start the Spark service in Cloudera Manager. To start Spark daemons for Spark's standalone resource manager, use the following procedure:

  1. Spark on the CDH platform is configured to work with YARN. Moreover, spark 2.0 is not available on CDH yet. So, download the latest pre-built spark 2.0 package for Hadoop as explained in Chapter 2, Getting Started with Apache Hadoop and Apache Spark. If you would like to use Spark 1.6 version, run the /usr/lib/spark/start-all.sh command.
  2. Start the service with following commands.
    cd /home/cloudera/spark-2.0.0-bin-hadoop2.7/sbin
    sudo ./start-all.sh
    
  3. Check the Spark UI at http://quickstart.cloudera:8080/.

Working with HDP, MapR, and Spark pre-built packages

Hortonworks Data Platform (HDP) and MapR Converged Data Platform distributions also include Hadoop, Spark, and many other projects needed for Big Data Analytics. While HDP uses Apache Ambari for deploying and managing the cluster, MapR uses the MapR Control System (MCS). Spark's pre-built package has no specific manager component for managing Spark. If you are planning to use the YARN resource manager, start the Spark service in Ambari or MCS. To Start Spark daemons for using Spark's standalone resource manager, use the following procedure.

  1. Start services with the following commands:
    • HDP: /usr/hdp/current/spark-client/sbin/start-all.sh
    • MapR: /opt/mapr/spark/spark-*/sbin/start-all.sh
    • Spark Package pre-built for Hadoop: ./sbin/start-all.sh

    For a multi node cluster, start spark worker roles on all machines with the following command:

    ./sbin/start-slave.sh spark://masterhostname:7077 
    

    Another option is to provide a list of the hostnames of the workers in the /conf/slaves file and then use the ./start-all.sh command to start worker roles on all machines automatically.

  2. Check logs located in the logs directory. Look at the master web UI at http://masterhostname:8080. If this port is already taken by another service, the next available port will be used. For example, in HDP, port 8080 is taken by Ambari, so the standalone master will bind to 8081. To find the correct port number, check the logs.

    Note

    All programs in this chapter are executed on CDH 5.8 VM. For other environments, the file paths might change but the concepts are the same in any environment.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.217.220