In this chapter, we will switch our focus to SPSS on Apache Spark as SPSS is a widely used tool for machine learning and data science computing.
Specifically, in this chapter, with a process similar to what we used in previous chapters, we will start with discussing setting up our SPSS on a Spark system for a recommendation project, together with a full description of this real-life project. Then, we will select machine learning methods and prepare the data. With SPSS Analytic Server, we will estimate models on Spark and then evaluate models with a focus on using error ratios. Finally, we will deploy the models for our client. Here are the topics that will be covered in this chapter:
In this section, we will continue to demonstrate Spark's computation speed and ease of coding for a real-life project of movie recommendation, but to be completed by SPSS on Apache Spark.
SPSS is a widely used software package for statistical analysis. SPSS originally stood for Statistical Package for Social Science, but it is also used by market researchers, health researchers, survey companies, government, education researchers, marketing organizations, data miners, and others. Long produced by SPSS Inc., it was acquired by IBM in 2009. Since then, IBM further developed it and turned it into a popular tool for data scientists and machine learning professionals. To make Spark available to SPSS users, IBM developed technologies making SPSS Spark integration easy, which will be covered in this chapter.
This project is to help movie rental company ZHO improve its movie recommendations to its customers.
The main data set contains tens of million ratings from more than 20,000 users on more than 10,000 movies.
Using the preceding rich dataset, the client hopes to improve its recommendation engine so that the recommendations are more useful to its customers. At the same time, the company wishes to take advantage of Spark so that it can update models quickly and also take advantage of Spark's parallel computing to develop recommendations for various movie categories as per special customer segmentations.
The company's analytical team learned about using Spark MLlib for movie recommendation cases and is familiar with the related literature, such as the one at http://ampcamp.berkeley.edu/big-data-mini-course/movie-recommendation-with-mllib.html.
However, the company's IT teams have utilized SPSS and SPSS Modeler for their analytics for many years, with a lot of analytical assets built on SPSS already, and their teams have used SPSS Modeler to organize analytical workflows for a long time; this is because they are heading toward some analytics automation, so the team prefers the approach of using SPSS on Spark.
Another reason for ZHO to adopt SPSS is to follow Cross-Industry Standard Process for Data Mining, which is an industry-proven standard process for machine learning, as shown in the following graph:
To use SPSS on Spark, we will need to use IBM SPSS Modeler 17.1 and IBM SPSS Analytics Server 2.1, which have good integration with Apache Spark.
Also, to adopt MLlib collaborative filtering on SPSS Modeler, you need to download IBM Predictive Extensions, as described in https://developer.ibm.com/predictiveanalytics/downloads/#tab2.
To install IBM Predictive Extensions, perform the following steps:
.cfe
file in the CDB directory, which is located by default on Windows in C:ProgramDataIBMSPSSModeler17.1CDB
or under your IBM SPSS Modeler installation directory.A more complete summary of SPSS on Spark can be founded at https://developer.ibm.com/predictiveanalytics/2015/11/06/spss-algorithms-optimized-for-apache-spark-spark-algorithms-extending-spss-modeler/.
The following is a screenshot of IBM SPSS Modeler, as you can see. SPSS users can move nodes into the central box to build modeling streams and then run them to obtain results:
With the SPSS Spark integration as described previously, SPSS Modeler users now gain a lot more advantages. Users can now create new Modeler nodes to exploit MLlib algorithms and share them.
For example, users can also use the custom dialog builder to access Python for Spark. The following screenshot shows the usage of Custom Dialog Builder for Python for Spark:
Specifically, Custom Dialog Builder adds Python for Spark support, which provides access to:
After doing so, users can create new Modeler nodes (extensions) that exploit algorithms from MLlib and other PySpark processes.
These nodes can be shared with others to democratize the access to Spark capabilities. Here, Spark becomes usable for nonprogrammers with code abstracted behind a GUI.
18.222.114.28