Why Spark?

Spark is a lightning fast cluster computing framework and is mainly designed for fast computations. Spark is based on the Hadoop MapReduce model and uses MapReduce in more forms and types of computation, such as interactive queries and stream processing. One of the main features of Spark is in-memory processing, which helps increase the performance and processing speed of an application. Spark supports a wide range of applications and workloads, such as the following:

  • Batch-based applications
  • Iterative algorithms that were not possible to run fast before
  • Interactive query and streaming

Also, it doesn't require much time for you to learn Spark and implement it in your applications without the need to understand the inner details of concurrency and distributed systems. Spark was implemented in 2009 at AMPLab of UC Berkeley. In 2010, they decided to make it open source. Then, Spark became an Apache release in 2013 and since then Spark has been considered as the most famous/used Apache-released software. Apache Spark became very famous because of its features:

  • Fast computations: Spark helps you to run applications that are faster than Hadoop because of its golden feature--in-memory processing.
  • Support for multiple programming languages: Apache Spark provides wrappers and built-in APIs in different languages such as Scala, Java, Python, or even R.
  • More analytics: As mentioned earlier, Spark supports MapReduce operations and it also supports more advanced analytics such as machine learning (MLlib), data streaming, and algorithms for graph processing.

As mentioned earlier, Spark is built on top of the Hadoop software and you can deploy Spark in different ways:

  • Standalone cluster: This means that Spark will run on top of Hadoop Distributed File System (HDFS) and space will actually be allocated to HDFS. Spark and MapReduce will run side by side to serve all the Spark jobs.
  • Hadoop YARN cluster: This means that Spark simply runs on YARN without any root privileges or pre-installations.
  • Mesos cluster: When a driver program creates a Spark job and starts assigning related tasks for scheduling, Mesos determines which computing nodes will handle which tasks. We assume that you have already configured and installed Mesos on your machine.
  • Deploy on pay-as-you-go cluster: You can deploy Spark jobs in real cluster mode on AWS EC2. To make your applications run on Spark cluster mode and for better scalability, you can consider Amazon Elastic Compute Cloud (EC2) services as Infrastructure as a Service (IaaS) or Platform as a Service (PaaS).
Refer to Chapter 17, Time to Go to ClusterLand - Deploying Spark on a Cluster and Chapter 18, Testing and Debugging Spark for how to deploy your data analytics application using Scala and Spark on a real cluster.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.77.76