In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:
How to share variables across a cluster's nodes
How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them
How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy
How to handle missing data in the dataset
Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario
Which learners are available and how to measure their performance in a distributed environment
How to run cross-validation for hyperparameter optimization in a cluster
Setting up the VM for this chapter
As machine learning needs a lot of computational power, in order to save some resources (especially memory) we will use the Spark environment not backed by YARN in this chapter. This mode of operation is named standalone and creates a Spark node without cluster functionalities; all the processing will be on the driver machine and won't be shared. Don't worry; the code that we will see in this chapter will work in a cluster environment as well.
In order to operate this way, perform the following steps:
Turn on the virtual machine using the vagrant up command.
Access the virtual machine when it's ready, with vagrant ssh.
Launch Spark standalone mode with the IPython Notebook from inside the virtual machine with ./start_jupyter.sh.
Open a browser pointing to http://localhost:8888.
To turn it off, use the Ctrl + C keys to exit the IPython Notebook and vagrant halt to turn off the virtual machine.
Note
Note that, even in this configuration, you can access the Spark UI (when at least an IPython Notebook is running) at the following URL: