Scala and the Spark programming model

Spark programming starts with a dataset or a few, usually residing in some form of distributed and persistent storage such as HDFS. A typical RDD programming model that Spark provides can be described as follows:

  • From an environment variable, Spark context (the Spark shell provides you with a Spark Context or you can make your own, this will be described later in this chapter) creates an initial data reference RDD object.
  • Transform the initial RDD to create more RDD objects following the functional programming style (to be discussed later on).
  • Send the code, algorithms, or applications from the driver program to the cluster manager nodes. Then, the cluster manager provides a copy to each computing node.
  • Computing nodes hold a reference to the RDDs in their partition (again, the driver program also holds a data reference). However, computing nodes could have the input dataset provided by the cluster manager as well.
  • After a transformation (via either narrow or wider transformation), the result to be generated is a brand new RDD, since the original one will not be mutated.
  • Finally, the RDD object or more (specifically, data reference) is materialized through an action to dump the RDD into the storage.
  • The driver program can ask the computing nodes for a chunk of results for the analysis or visualization of a program.

Wait! So far we have moved smoothly. We suppose you will ship your application code to the computing nodes in the cluster. Still, you will have to upload or send the input datasets to the cluster to be distributed among the computing nodes. Even during the bulk upload, you will have to transfer the data across the network. We also argue that the size of the application code and results are negligible or trivial. Another obstacle is if you want Spark to process the data at scale computation, it might require data objects to be merged from multiple partitions first. This means we will need to shuffle data among the worker/computing nodes that is usually done by partition(), intersection(), and join() transformation operations.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.31.180