Spark and big data analytics

Spark operates across nodes in the cluster in a similar way as Hive. The job or set of jobs that you are executing over Spark is called an application. Applications run in Spark as independent sets of processes across the cluster. These are coordinated by the main component called the Driver. The key object that operates in the Driver is called SparkContent.

The following diagram is a very simple architecture diagram of Spark. Spark can connect to different types of cluster managers such as Apache Mesos, YARN, or its own simple Spark standalone cluster manager. YARN is the most common implementation.

Apache Spark architecture. Source: Apache software foundation

Through the cluster manager, the driver program acquires resources on each worker node, which runs computation and stores data in support of executing the application. These are system processes that are called Executors. An Executor will process one or more tasks as instructed by the driver program.

SparkContext is created for each user session, and it is isolated from other SparkContexts. This means applications are isolated from one another. This adds to overall robustness but also means it is not easy for different SparkContexts to share data. In practice, this is not much of an issue.

The Executors acquired by an individual SparkContext are also isolated from each other. Each Executor runs on its own Java Virtual Machine (JVM) on the worker nodes. The Driver program sends the executors the application code in the form of Java JAR files or Python files. Then, it sends the tasks to be executed.

The Driver program must be able to listen and accept incoming communications from its executors. Executors must be able to talk to each other. All components should be "near" each other in the cluster. For this reason, it is not a good idea to be running a Driver program (Spark instance) on your laptop that is connected directly to the cluster.

It is common and recommended practice to install Spark on the same cluster nodes as HDFS. This puts the processing close to the data and results in improved performance.

It is very important to understand what portion of your application runs on the Driver and what portion runs across the worker nodes. The Driver is located on a single node and is limited to the CPU and memory available to it on that node. For this reason, data, memory, and computer requirements for operations must fit within those limits or the application will fail. What will run fine across the cluster, which has a much higher set of distributed resources, will crash and burn if unintentionally pulled back to the driver node.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.119.131.178