What is Apache Spark?

Good question, glad you asked. Spark was built for distributed cluster computing, so everything scales nicely without any code changes. The word general in the general engine description is very appropriate for Spark. It refers to the many and varied ways you can use it.

You can use it for ETL data processing, machine learning modeling, graph processing, stream data processing, and SQL and structure data processing. It is a boon for analytics in a distributed computing world.

It has APIs for multiple programming languages such as Java, Scala, Python, and R. It operates mostly in memory, which is where the speed improvement over Hadoop MapReduce mainly comes from. For analytics, Python and R are the popular programming languages. When interacting with Spark, you will probably be using Python, as it is better supported. The Python API for Spark is called Pyspark.

The descriptions and architecture discussed in this chapter are for Spark 2.1.0, the latest version at the time of writing.

Table of Contents for What is Apache Spark?

Create new playlist

Sign In

Sign Up

Table of Contents for
What is Apache Spark?