Chapter 1. Spark for Machine Learning

This chapter provides an introduction to Apache Spark from a Machine Learning (ML) and data analytics perspective, and also discusses machine learning in relation to Spark computing. Here, we first present an overview of Apache Spark, as well as Spark's advantages for data analytics, in comparison to MapReduce and other computing platforms. Then we discuss five main issues, as below:

  • Machine learning algorithms and libraries
  • Spark RDD and dataframes
  • Machine learning frameworks
  • Spark pipelines
  • Spark notebooks

All of the above are the most important topics that any data scientist or machine learning professional is expected to master, in order to fully take advantage of Apache Spark computing. Specifically, this chapter will cover all of the following six topics.

  • Spark overview and Spark advantages
  • ML algorithms and ML libraries for Spark
  • Spark RDD and dataframes
  • ML Frameworks, RM4Es and Spark computing
  • ML workflows and Spark pipelines
  • Spark notebooks introduction

Spark overview and Spark advantages

In this section, we provide an overview of the Apache Spark computing platform and a discussion about some advantages of utilizing Apache Spark, in comparison to using other computing platforms like MapReduce. Then, we briefly discuss how Spark computing fits modern machine learning and big data analytics.

After this section, readers will form a basic understanding of Apache Spark as well as a good understanding of some important machine learning benefits from utilizing Apache Spark.

Spark overview

Apache Spark is a computing framework for the fast processing of big data. This framework contains a distributed computing engine and a specially designed programming model. Spark was started as a research project at the AMPLab of the University of California at Berkeley in 2009, and then in 2010 it became fully open sourced as it was donated to the Apache Software Foundation. Since then, Apache Spark has experienced exponential growth, and now Spark is the most active open source project in the big data field.

Spark's computing utilizes an in-memory distributed computational approach, which makes Spark computing among the fastest, especially for iterative computation. It can run up to 100 times faster than Hadoop MapReduce, according to many tests that have been performed.

Apache Spark has a unified platform, which consists of the Spark core engine and four libraries: Spark SQL, Spark Streaming, MLlib, and GraphX. All of these four libraries have Python, Java and Scala programming APIs.

Besides the above mentioned four built-in libraries, there are also tens of packages available for Apache Spark, provided by third parties, which can be used for handling data sources, machine learning, and other tasks.

Spark overview

Apache Spark has a 3 month circle for new releases, with Spark version 1.6.0 released on January 4 of 2016. Apache Spark release 1.3 had DataFrames API and ML Pipelines API included. Starting from Apache Spark release 1.4, the R interface (SparkR) is included as default.

Note

To download Apache Spark, readers should go to http://spark.apache.org/downloads.html.

To install Apache Spark and start running it, readers should consult its latest documentation at http://spark.apache.org/docs/latest/.

Spark advantages

Apache Spark has many advantages over MapReduce and other big data computing platforms. Among them, the distinguished two are that it is fast to run and fast to write.

Overall, Apache Spark has kept some of MapReduce's most important advantages like that of scalability and fault tolerance, but extended them greatly with new technologies.

In comparison to MapReduce, Apache Spark's engine is capable of executing a more general Directed Acyclic Graph (DAG) of operators. Therefore, when using Apache Spark to execute MapReduce-style graphs, users can achieve higher performance batch processing in Hadoop.

Apache Spark has in-memory processing capabilities, and uses a new data abstraction method, Resilient Distributed Dataset (RDD), which enables highly iterative computing and reactive applications. This also extended its fault tolerance capability.

At the same time, Apache Spark has made complex pipeline representation easy with only a few lines of code needed. It is best known for the ease with which it can be used to create algorithms that capture insight from complex and even messy data, and also enable users to apply that insight in-time to drive outcomes.

As summarized by the Apache Spark team, Spark enables:

  • Iterative algorithms in Machine Learning
  • Interactive data mining and data processing
  • Hive-compatible data warehousing that can run 100x faster
  • Stream processing
  • Sensor data processing

To a practical data scientist working with the above, Apache Spark easily demonstrates its advantages when it is adopted for:

  • Parallel computing
  • Interactive analytics
  • Complex computation

Most users are satisfied with Apache Spark's advantages in speed and performance, but some also noted that Apache Spark is still in the process of maturing.

Note

http://www.svds.com/use-cases-for-apache-spark/ has some examples of materialized Spark benefits.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.205.27