Chapter 7. Recommendations on Spark

In this chapter, we will switch our focus to SPSS on Apache Spark as SPSS is a widely used tool for machine learning and data science computing.

Specifically, in this chapter, with a process similar to what we used in previous chapters, we will start with discussing setting up our SPSS on a Spark system for a recommendation project, together with a full description of this real-life project. Then, we will select machine learning methods and prepare the data. With SPSS Analytic Server, we will estimate models on Spark and then evaluate models with a focus on using error ratios. Finally, we will deploy the models for our client. Here are the topics that will be covered in this chapter:

  • Spark for a recommendation engine
  • Methods for recommendation development
  • Data treatment
  • Model estimation
  • Model evaluation
  • Recommendation deployment

Apache Spark for a recommendation engine

In this section, we will continue to demonstrate Spark's computation speed and ease of coding for a real-life project of movie recommendation, but to be completed by SPSS on Apache Spark.

SPSS is a widely used software package for statistical analysis. SPSS originally stood for Statistical Package for Social Science, but it is also used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations, data miners, and others. Long produced by SPSS Inc., it was acquired by IBM in 2009. Since then, IBM further developed it and turned it into a popular tool for data scientists and machine learning professionals. To make Spark available to SPSS users, IBM developed technologies making SPSS Spark integration easy, which will be covered in this chapter.

The use case

This project is to help movie rental company ZHO improve its movie recommendations to its customers.

The main data set contains tens of million ratings from more than 20,000 users on more than 10,000 movies.

Using the preceding rich dataset, the client hopes to improve its recommendation engine so that the recommendations are more useful to its customers. At the same time, the company wishes to take advantage of Spark so that it can update models quickly and also take advantage of Spark's parallel computing to develop recommendations for various movie categories as per special customer segmentations.

The company's analytical team learned about using Spark MLlib for movie recommendation cases and is familiar with the related literature, such as the one at http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html.

However, the company's IT teams have utilized SPSS and SPSS Modeler for their analytics for many years, with a lot of analytical assets built on SPSS already, and their teams have used SPSS Modeler to organize analytical workflows for a long time; this is because they are heading toward some analytics automation, so the team prefers the approach of using SPSS on Spark.

Another reason for ZHO to adopt SPSS is to follow Cross-Industry Standard Process for Data Mining, which is an industry-proven standard process for machine learning, as shown in the following graph:

The use case

SPSS on Spark

To use SPSS on Spark, we will need to use IBM SPSS Modeler 17.1 and IBM SPSS Analytics Server 2.1, which have good integration with Apache Spark.

Also, to adopt MLlib collaborative filtering on SPSS Modeler, you need to download IBM Predictive Extensions, as described in https://developer.ibm.com/predictiveanalytics/downloads/#tab2.

To install IBM Predictive Extensions, perform the following steps:

  1. Download the extension at Download
  2. Close IBM SPSS Modeler. Save the .cfe file in the CDB directory, which is located by default on Windows in C:ProgramDataIBMSPSSModeler17.1CDB or under your IBM SPSS Modeler installation directory.
  3. Restart IBM SPSS Modeler, and the node will now appear in the Model palette.
    SPSS on Spark

The following is a screenshot of IBM SPSS Modeler, as you can see. SPSS users can move nodes into the central box to build modeling streams and then run them to obtain results:

SPSS on Spark

With the SPSS Spark integration as described previously, SPSS Modeler users now gain a lot more advantages. Users can now create new Modeler nodes to exploit MLlib algorithms and share them.

For example, users can also use the custom dialog builder to access Python for Spark. The following screenshot shows the usage of Custom Dialog Builder for Python for Spark:

SPSS on Spark

Specifically, Custom Dialog Builder adds Python for Spark support, which provides access to:

  • Spark and its machine learning library (MLlib)
  • The other common Python libraries such as Numpy, Scipy, Scikit-learn, and Pandas

After doing so, users can create new Modeler nodes (extensions) that exploit algorithms from MLlib and other PySpark processes.

SPSS on Spark

These nodes can be shared with others to democratize the access to Spark capabilities. Here, Spark becomes usable for nonprogrammers with code abstracted behind a GUI.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.114.28