What is Apache Spark?

Good question, glad you asked. Spark was built for distributed cluster computing, so everything scales nicely without any code changes. The word general in the general engine description is very appropriate for Spark. It refers to the many and varied ways you can use it.

You can use it for ETL data processing, machine learning modeling, graph processing, stream data processing, and SQL and structure data processing. It is a boon for analytics in a distributed computing world.

It has APIs for multiple programming languages such as Java, Scala, Python, and R. It operates mostly in memory, which is where the speed improvement over Hadoop MapReduce mainly comes from. For analytics, Python and R are the popular programming languages. When interacting with Spark, you will probably be using Python, as it is better supported. The Python API for Spark is called Pyspark.

The descriptions and architecture discussed in this chapter are for Spark 2.1.0, the latest version at the time of writing.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.214.215