Lifting the hood

In the last section of this chapter, we will discuss, very briefly, how Spark works internally. For a more detailed discussion, see the References section at the end of the chapter.

When you open a Spark context, either explicitly or by launching the Spark shell, Spark starts a web UI with details of how the current task and past tasks have executed. Let's see this in action for the example mutual information program we wrote in the last section. To prevent the context from shutting down when the program completes, you can insert a call to readLine as the last line of the main method (after the call to takeOrdered). This expects input from the user, and will therefore pause program execution until you press enter.

To access the UI, point your browser to 127.0.0.1:4040. If you have other instances of the Spark shell running, the port may be 4041, or 4042 and so on.

Lifting the hood

The first page of the UI tells us that our application contains three jobs. A job occurs as the result of an action. There are, indeed, three actions in our application: the first two are called within the wordFractionInFiles function:

val nMessages = messages.count()

The last job results from the call to takeOrdered, which forces the execution of the entire pipeline of RDD transformations that calculate the mutual information.

The web UI lets us delve deeper into each job. Click on the takeOrdered job in the job table. You will get taken to a page that describes the job in more detail:

Lifting the hood

Of particular interest is the DAG visualization entry. This is a graph of the execution plan to fulfill the action, and provides a glimpse of the inner workings of Spark.

When you define a job by calling an action on an RDD, Spark looks at the RDD's lineage and constructs a graph mapping the dependencies: each RDD in the lineage is represented by a node, with directed edges going from this RDD's parent to itself. This type of graph is called a directed acyclic graph (DAG), and is a data structure useful for dependency resolution. Let's explore the DAG for the takeOrdered job in our program using the web UI. The graph is quite complex, and it is therefore easy to get lost, so here is a simplified reproduction that only lists the RDDs bound to variable names in the program.

Lifting the hood

As you can see, at the bottom of the graph, we have the mutualInformation RDD. This is the RDD that we need to construct for our action. This RDD depends on the intermediate elements in the sum, igFragment1, igFragment2, and so on. We can work our way back through the list of dependencies until we reach the other end of the graph: RDDs that do not depend on other RDDs, only on external sources.

Once the graph is built, the Spark engines formulates a plan to execute the job. The plan starts with the RDDs that only have external dependencies (such as RDDs built by loading files from disk or fetching from a database) or RDDs that already have cached data. Each arrow along the graph is translated to a set of tasks, with each task applying a transformation to a partition of the data.

Tasks are grouped into stages. A stage consists of a set of tasks that can all be performed without needing an intermediate shuffle.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.107.31