Implementing data processing in Apache Spark

Let's see how we can create an RDD in Apache Spark and run distributed processing on it across the cluster:

For this, first, we need to create a new Spark session, as follows:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('cloudanum').getOrCreate()

Once we have created a Spark session, we use a CSV file for the source of the RDD. Then, we will run the following function—it will create an RDD that is abstracted as a DataFrame called df. The ability to abstract an RDD as a DataFrame was added in Spark 2.0 and this makes it easier to process the data:

df = spark.read.csv('taxi2.csv',inferSchema=True,header=True)

Let's look into the columns of the DataFrame:

Next, we can create a temporary table from the DataFrame, as follows:

df.createOrReplaceTempView("main")

Once the temporary table is created, we can run SQL statements to process the data:

An important point to note is that although it looks like a regular DataFrame, it is just a high-level data structure. Under the hood, it is the RDD that spreads data across the cluster. Similarly, when we run SQL functions, under the hood, they are converted into parallel transformer and reducers and they fully use the power of the cluster to process the code.

Table of Contents for Implementing data processing in Apache Spark

Create new playlist

Sign In

Sign Up

Table of Contents for
Implementing data processing in Apache Spark