RDD Creation

An RDD is the fundamental object used in Apache Spark. They are immutable collections representing datasets and have the inbuilt capability of reliability and failure recovery. By nature, RDDs create new RDDs upon any operation such as transformation or action. RDDs also store the lineage which is used to recover from failures. We have also seen in the previous chapter some details about how RDDs can be created and what kind of operations can be applied to RDDs.

An RDD can be created in several ways:

  • Parallelizing a collection
  • Reading data from an external source
  • Transformation of an existing RDD
  • Streaming API
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.229.92