Spark core

Spark core is the underlying general execution engine for the Spark platform that all other functionality is built upon. Spark core contains basic Spark functionalities required for running jobs and needed by other components. It provides in-memory computing and referencing datasets in external storage systems, the most important being the Resilient Distributed Dataset (RDD).

In addition, Spark core contains logic for accessing various filesystems, such as HDFS, Amazon S3, HBase, Cassandra, relational databases, and so on. Spark core also provides fundamental functions to support networking, security, scheduling, and data shuffling to build a high scalable, fault-tolerant platform for distributed computing.

We cover Spark core in detail in Chapter 6, Start Working with Spark - REPL and RDDs and Chapter 7, Special RDD Operations.

DataFrames and datasets built on top of RDDs and introduced with Spark SQL are becoming the norm now over RDDs in many use cases. RDDs are still more flexible in terms of handling totally unstructured data, but in future datasets, API might eventually become the core API.

Table of Contents for Spark core

Create new playlist

Sign In

Sign Up

Table of Contents for
Spark core