Chapter 6. Churn Prediction on Spark

In this chapter, we will focus on the utilization of some Apache Spark machine learning libraries, especially MLlib, as applied to a churn predictive modeling project.

Specifically, in this chapter, we will first review machine learning methods and the related computing for a churn prediction project, and will then discuss how Apache Spark MLlib makes things easy and fast. At the same time, with a real life churn prediction example, we will illustrate the step-by-step process of predicting churns with big data. The following topics will be covered in this chapter:

  • Spark for churn prediction
  • Methods for churn prediction
  • Feature preparation
  • Model estimation
  • Model evaluation
  • Results explanation
  • Model deployment

Spark for churn prediction

In this section, we will start with a real-life business case description, and then review the steps for preparing the Apache Spark computing for our churn prediction project.

The use case

The YST Corporation is a big auto corporation selling and leasing vehicles to millions of customers. The company wishes to improve customer retention by using machine learning with big data, as they understand that consumers today go through a complex decision making process before purchasing or leasing a car, that it is becoming increasingly important to proactively identify customers that have a tendency to leave, and take preventive interventions to retain such customers.

The company has collected a lot of customer satisfaction data through their dealers and service centers as well as through their frequently conducted customer surveys. At the same time, the company has collected data for customers' online behavior from their web sites along with some social media data. Of course, the company has its transaction data for each purchase and car lease, and also a lot of data about their products and their services besides the various promotions and interventions they implemented in the past. The goal of this machine learning project is to build a predictive model for the company to understand how their product features and service improvements, together with promotion interventions, affect customer satisfaction, and then customer churns.

To sum up, for this project, we have a target variable, customer defection, and a lot of data about customer behavior, products, and services as well as company interventions such as promotions to form features as predictors.

Through some preliminary analysis, the company understands some of their data challenges as follows:

  • Data is not ready to use; the web log data, especially, needs to be extracted into features ready for machine learning
  • There are many kinds of cars with various leasing and purchasing options for various kinds of customers, for which the customer churn patterns are very different from each other
  • Data exists in different silos, which needs to be merged together

To deal with the aforementioned challenges, in the actual process of delivering good machine learning results for this real-life project, we utilized some techniques presented in Chapter 3, A Holistic View on Spark to merge all the datasets together, and also some feature extraction techniques along with some distributed computing techniques discussed in the previous chapters. In this chapter, we will focus our efforts on utilizing machine learning libraries to attack problems, and to complete good machine learning.

Spark computing

As seen in the preceding chapter, for this machine learning project of customer churn prediction, parallel computing is needed due to the many kinds of cars for various customer segments. For this, we need to set up clusters and worker nodes as before, while completing our Apache Spark installation.

As discussed in section, Spark overview of Chapter 1, Spark for Machine Learning, Apache Spark has a unified platform which consists of the Spark core engine and four libraries that include Spark SQL, Spark Streaming, MLlib, and GraphX. All four libraries have Python, Java, and Scala programming APIs.

Among the four libraries, MLlib is the one that is most needed for this chapter. Besides the aforementioned built-in library MLlib, there are also many machine learning packages available for Apache Spark, as provided by third parties. One example is IBM's SystemML, which contains a lot more algorithms than those offered by MLlib. SystemML is being integrated with Apache Spark.

Spark computing

Because MLlib is Apache Spark's built-in machine learning library, there is not much work needed to prepare it, which is a great advantage over other machine learning libraries. Another advantage is that it is scalable, and consists of many commonly used machine learning algorithms such as algorithms for:

  • Performing classification and regression modeling
  • Collaborative filtering
  • Performing dimensionality reduction
  • Conducting feature extraction and transformation
  • Exporting PMML models

We will need all the preceding algorithms for this project. Spark MLlib is still under active development, with new algorithms expected to be added with every new release.

To download Apache Spark, readers can go to http://spark.apache.org/downloads.html.

To install Apache Spark and start running it, readers can consult its latest documentation at http://spark.apache.org/docs/latest/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.199.181