Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 9. Practical Machine Learning with Spark

In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:

How to share variables across a cluster's nodes
How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them
How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy
How to handle missing data in the dataset
Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario
Which learners are available and how to measure their performance in a distributed environment
How to run cross-validation for hyperparameter optimization in a cluster

Setting up the VM for this chapter

As machine learning needs a lot of computational power, in order to save some resources (especially memory) we will use the Spark environment not backed by YARN in this chapter. This mode of operation is named standalone and creates a Spark node without cluster functionalities; all the processing will be on the driver machine and won't be shared. Don't worry; the code that we will see in this chapter will work in a cluster environment as well.

In order to operate this way, perform the following steps:

Turn on the virtual machine using the vagrant up command.
Access the virtual machine when it's ready, with vagrant ssh.
Launch Spark standalone mode with the IPython Notebook from inside the virtual machine with ./start_jupyter.sh.
Open a browser pointing to http://localhost:8888.

To turn it off, use the Ctrl + C keys to exit the IPython Notebook and vagrant halt to turn off the virtual machine.

Note

Note that, even in this configuration, you can access the Spark UI (when at least an IPython Notebook is running) at the following URL:

http://localhost:4040

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 9. Practical Machine Learning with Spark

Create new playlist

Sign In

Sign Up

Chapter 9. Practical Machine Learning with Spark

Setting up the VM for this chapter

Note

Table of Contents for
9. Practical Machine Learning with Spark