How to do it...

Start a new project in IntelliJ or in an IDE of your choice. Make sure the necessary JAR files are included.

package spark.ml.cookbook.chapter3

import breeze.numerics.pow 
import org.apache.spark.sql.SparkSession 
import Array._

Import the packages for setting up logging level for log4j. This step is optional, but we highly recommend it (change the level appropriately as you move through the development cycle).

import org.apache.log4j.Logger 
import org.apache.log4j.Level

Set up the logging level to warning and error to cut down on output. See the previous step for package requirements.

Logger.getLogger("org").setLevel(Level.ERROR) 
Logger.getLogger("akka").setLevel(Level.ERROR)

val spark = SparkSession 
  .builder 
  .master("local[*]") 
  .appName("myRDD") 
  .config("Spark.sql.warehouse.dir", ".") 
  .getOrCreate()

We obtain the data from the Gutenberg project. This is a great source for accessing actual text, ranging from the complete works of Shakespeare to Charles Dickens.

Download the text from the following sources and store it in your local directory:
- Source: http://www.gutenberg.org
- Selected book: A Tale of Two Cities by Charles Dickens
- URL: http://www.gutenberg.org/cache/epub/98/pg98.txt

Once again, we use SparkContext, available via SparkSession, and its function textFile() to read the external data source and parallelize it across the cluster. Remarkably, all the work is done for the developer behind the scenes by Spark using one single call to load a wide variety of formats (for example, text, S3, and HDFS), which parallelizes the data across the cluster using the protocol:filepath combination.

To demonstrate, we load the book, which is stored as ASCII, text using the textFile() method from SparkContext via SparkSession, which, in turn goes to work behind the scenes and creates portioned RDDs across the cluster.

val book1 = spark.sparkContext.textFile("../data/sparkml2/chapter3/a.txt")

The output will be as follows:

Number of lines = 16271

Even though we have not covered the Spark transformation operator, we'll look at a small code snippet which will break the file into words using blanks as a separator. In a real-life situation, a regular expression will be needed to cover all the edge cases with all the whitespace variations (refer to the Transforming RDDs with Spark using filter() APIs recipe in this chapter).
- We use a lambda function to receive each line as it is read and split it into words using blanks as separator.
- We use a flatMap to break the array of lists of words (that is, each group of words from a line corresponds to a distinct array/list for that line). In short, what we want is a list of words and not a list of a list of words for each line.

val book2 = book1.flatMap(l => l.split(" ")) 
println(book1.count())

The output will be as follows:

Number of words = 143228

Table of Contents for How to do it...