Chapter 9. Practical Machine Learning with Spark

In the previous chapter, we saw the main functionalities of data processing with Spark. In this chapter, we will focus on data science with Spark on a real data problem. During the chapter, you will learn the following topics:

  • How to share variables across a cluster's nodes
  • How to create DataFrames from structured (CSV) and semi-structured (JSON) files, save them on disk, and load them
  • How to use SQL-like syntax to select, filter, join, group, and aggregate datasets, thus making the preprocessing extremely easy
  • How to handle missing data in the dataset
  • Which algorithms are available out of the box in Spark for feature engineering and how to use them in a real case scenario
  • Which learners are available and how to measure their performance in a distributed environment
  • How to run cross-validation for hyperparameter optimization in a cluster

Setting up the VM for this chapter

As machine learning needs a lot of computational power, in order to save some resources (especially memory) we will use the Spark environment not backed by YARN in this chapter. This mode of operation is named standalone and creates a Spark node without cluster functionalities; all the processing will be on the driver machine and won't be shared. Don't worry; the code that we will see in this chapter will work in a cluster environment as well.

In order to operate this way, perform the following steps:

  1. Turn on the virtual machine using the vagrant up command.
  2. Access the virtual machine when it's ready, with vagrant ssh.
  3. Launch Spark standalone mode with the IPython Notebook from inside the virtual machine with ./start_jupyter.sh.
  4. Open a browser pointing to http://localhost:8888.

To turn it off, use the Ctrl + C keys to exit the IPython Notebook and vagrant halt to turn off the virtual machine.

Note

Note that, even in this configuration, you can access the Spark UI (when at least an IPython Notebook is running) at the following URL:

http://localhost:4040

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.17.46