Working with DataFrames and RDDs

SparkDataFrame is a distributed collection of rows under named columns. Less technically, it can be considered as a table in a relational database with column headers. Furthermore, PySpark DataFrame is similar to Python pandas. However, it also shares some mutual characteristics with RDD:

  • Immutable: Just like an RDD, once a DataFrame is created, it can't be changed. We can transform a DataFrame to an RDD and vice versa after applying transformations.
  • Lazy Evaluations: Its nature is a lazy evaluation. In other words, a task is not executed until an action is performed.
  • Distributed: Both the RDD and DataFrame are distributed in nature.

Just like Java/Scala's DataFrames, PySpark DataFrames are designed for processing a large collection of structured data; you can even handle petabytes of data. The tabular structure helps us understand the schema of a DataFrame, which also helps optimize execution plans on SQL queries. Additionally, it has a wide range of data formats and sources.

You can create RDDs, datasets, and DataFrames in a number of ways using PySpark. In the following subsections, we will show some examples of doing that.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.156.161