Programming in PySpark

This section provides a quick introduction to programming with Python in Spark. We will start with the basic data structures in Spark.

Resilient Distributed Datasets (RDD) is the primary data structure in Spark. It is a distributed collection of objects and has the following three main features:

Resilient: When any node fails, affected partitions will be reassigned to healthy nodes, which makes Spark fault-tolerant
Distributed: Data resides on one or more nodes in a cluster, which can be operated on in parallel
Dataset: This contains a collection of partitioned data with their values or metadata

RDD was the main data structure in Spark before version 2.0. After that, it is replaced by the DataFrame , which is also a distributed collection of data but organized into named columns. DataFrame utilizes the optimized execution engine of Spark SQL. Therefore, it is conceptually similar to a table in a relational database or a DataFrame object in the Python pandas library.

Although the current version of Spark still supports RDD, programming with DataFrames is highly recommended. Hence, we won't spent too much time here on programming with RDD. Refer to https://spark.apache.org/docs/latest/rdd-programming-guide.html if you are interested. We will go through the basics of programming with a dataframe.

The entry point to a Spark program is creating a Spark session, which can be done by using the following lines:

>>> from pyspark.sql import SparkSession
>>> spark = SparkSession 
...     .builder 
...     .appName("test") 
...     .getOrCreate()

Note that this is not needed if you run it in PySpark shell. Right after we spin up a PySpark shell, a Spark session is automatically created. We can check the running Spark application at the following link: localhost:4040/jobs/. Refer to the following screenshot for the resulting page:

With a Spark session spark, a DataFrame object can be created by reading a file (which is usually the case) or manual input. In the following example, we create a DataFrame object from a CSV file:

>>> df = spark.read.csv("examples/src/main/resources/people.csv", 
                                                header=True, sep=';')

Columns in the CSV file people.csv are separated by ;.

Once this is done, we can see an accomplished job in localhost:4040/jobs/:

We can display the content of the DataFrame object by using the following command:

>>> df.show()
+-----+---+---------+
| name|age|      job|
+-----+---+---------+
|Jorge| 30|Developer|
|  Bob| 32|Developer|
+-----+---+---------+

We can count the number of rows by using the following command:

>>> df.count()
2

The schema of the DataFrame object can be displayed using the following command:

>>> df.printSchema()
root
 |-- name: string (nullable = true)
 |-- age: string (nullable = true)
 |-- job: string (nullable = true)

One or more columns can be selected as follows:

>>> df.select("name").show()
+-----+
| name|
+-----+
|Jorge|
|  Bob|
+-----+
>>> df.select(["name", "job"]).show()
+-----+---------+
| name|      job|
+-----+---------+
|Jorge|Developer|
|  Bob|Developer|
+-----+---------+

We can filter rows by condition, for instance, by the value of one column using the following command:

>>> df.filter(df['age'] > 31).show()
+----+---+---------+
|name|age|      job|
+----+---+---------+
| Bob| 32|Developer|
+----+---+---------+

We will continue programming in PySpark in the next section, where we use Spark to solve the ad click-through problem.

Table of Contents for Programming in PySpark

Create new playlist

Sign In

Sign Up

Table of Contents for
Programming in PySpark