Practical example &#x2013; divide-and-conquer applied to Apache Spark

Apache Spark is an open source framework that is used to solve complex distributed problems. It implements a divide-and-conquer strategy to solve problems. To process a problem, it divides the problem into various subproblems and processes them independently of each other. We will demonstrate this by using a simple example of counting words from a list.

Let's assume that we have the following list of words:

wordsList = [python, java, ottawa, news, java, ottawa]

We want to calculate the frequency of each word in this list. For that, we will apply the divide-and-conquer strategy to solve this problem in an efficient way.

The implementation of divide-and-conquer is shown in the following diagram:

The preceding diagram shows the following phases into which a problem is divided:

Splitting: The input data is divided into partitions that can be processed independently of each other. This is called splitting. We have three splits in the preceding figure.
Mapping: Any operation that can run independently on a split is called a map. In the preceding diagram, the map operation coverts each of the words in the partition to key-value pairs. Corresponding to the three splits, there are three mappers that are run in parallel.
Shuffling: Shuffling is the process of bringing similar keys together. Once the similar keys are brought together, aggregation functions can be run on their values. Note that shuffling is a performance-intensive operation as similar keys need to be brought together that can be originally distributed across the network.
Reducing: Running an aggregation function on the values of similar keys is called reducing. In the preceding diagram, we have to count the number of words.

Let's see how we can write the code to implement this. To demonstrate the divide-and-conquer strategy, we need a distributed computing framework. We will run Python running on Apache Spark for this:

First, in order to use Apache Spark, we will create a runtime context of Apache Spark:

import findspark
findspark.init()
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]").getOrCreate()
sc = spark.sparkContext

Now, let's create a sample list containing some words. We will convert this list into Spark's native distributed data structure, called a Resilient Distributed Dataset (RDD):

wordsList = ['python', 'java', 'ottawa', 'ottawa', 'java','news']
wordsRDD = sc.parallelize(wordsList, 4)
# Print out the type of wordsRDD
print (wordsRDD.collect())

Now, let's use a map function to convert the words into a key-value pair:

Let's use the reduce function to aggregate and get the final result:

This shows how we can use the divide-and-conquer strategy to count the number of words.

Modern cloud computing infrastructures, such as Microsoft Azure, Amazon Web Services, and Google Cloud, achieve scalability by implementing a divide-and-conquer strategy either directly or indirectly behind the scenes.

Table of Contents for
Practical example – divide-and-conquer applied to Apache Spark

Practical example – divide-and-conquer applied to Apache Spark

Table of Contents for Practical example &#x2013; divide-and-conquer applied to Apache Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Practical example – divide-and-conquer applied to Apache Spark