Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Resilient distributed datasets

Resilient distributed datasets, more commonly known as RDDs, are the primary data structure used in Spark. RDDs are essentially a collection of records that are stored across a Spark cluster in a distributed manner. RDDs are immutable, which is to say, they cannot be altered once created. RDDs that are stored across nodes can be accessed in parallel, and hence support parallel operations natively.

The user does not need to write separate code to get the benefits of parallelization but can get the benefits of actions and transformations of data simply by running specific commands that are native to the Spark platform. Because RDDs can be also stored in memory, as an additional benefit, the parallel operations can act on the data directly in memory without incurring expensive I/O access penalties.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

3.129.210.102

Table of Contents for Resilient distributed datasets

Create new playlist

Sign In

Sign Up

Table of Contents for
Resilient distributed datasets