Chapter 10. Learning Telco Data on Spark

With a new approach different from the approaches in the previous chapters, in this chapter and the next chapter, we will start with a set of huge amount of data and then let the data lead us. In other words, we will apply Spark machine learning to certain types of big datasets, and then the data needs and new insights will guide our machine learning to result in useful and actionable insights, by taking advantage of the processes made easily and fast with Apache Spark. As for this chapter, we will work on telco data, and then, for the next chapter, we will work on open data made available by various levels of governments.

By following a similar process adopted in the previous chapters, in this chapter, we will first review machine learning methods and related Spark computing to use Telco Data to learn more about customer behavior insights. We will then discuss how Apache Spark comes in to make them easy as before. At the same time, with this real-life use case of customer behavior insight discovery, we will also follow our 4E process of working on equations selection, estimation, evaluation, and explanation to illustrate our step-by-step machine learning process of segmenting customers and scoring customers with this big Telco Data.

However, as you are expected to have gained knowledge of Spark computing and the related tools, including R and SPSS, at this stage, we will jump around the 4Es as the machine learning needs. We will also not be limited by working only on one project or on one model. Specifically for this chapter, we will work to discover insights, score customers, and then build predictive models of the newly developed scores to go deeper in solving clients' problems.

Here, we will use a real-life project to illustrate our technologies and processes, with a computing focus on customer scoring and explanation of the scores. However, what is described here is not limited to score customers, but can also be easily applied to other machine learning projects, such as marketing effectiveness or service quality studies. We will cover the following topics in this chapter:

  • Spark for Telco Data learning
  • Methods to learn from Telco Data
  • Data and feature development
  • Model estimation
  • Model evaluation
  • Results explanation
  • Model deployment

Spark for using Telco Data

In this section, we will start with a real-life use case of learning from Telco Data and then describe how to prepare Apache Spark computing for this real-life project of Telco Data machine learning.

The use case

Telco companies in the United States and also in other regions have huge amounts of data in their hands now. Many telco companies have started considering this data as their most valuable asset. They have started utilizing the data not only for their own data-driven decision making, but also for their clients. Specifically, some telco companies started using Big Data analytics to differentiate their offerings and target customers more effectively, thereby generating greater customer loyalty as well as taking advantage of new innovative business models. They also used data to increase operational efficiencies and improve effectiveness of customer-experience management. To serve their corporate clients, some telco companies have used the data to help segment customers in better ways to increase marketing effectiveness.

As for this exercise, the telco corp. VRS provided us with a big dataset to start with. The dataset contains call data and other basic information about their millions of subscribers.

However, the raw data is just a collection of many codes, with things such as 1bbddf1… to represent IDs and 73de6rd… to represent location. So, we would need to utilize some subject knowledge to make use of them and then develop new features from the raw data.

The telco company is interested in learning any useful insights from the data. So, we were asked to explore any insights that could be learned from the data and then help them build some models to predict customer churn, Call Center calls, and purchase propensities, if possible. Once these scores get built, it is also helpful for the client to understand what affects these scores. So, the project is a very practical one. It is data driven and problems driven. We have strong interests to showcase our Apache Spark technologies, but the client is only interested in how the Apache Spark technologies can help discover new and useful insights faster and better.

Spark computing

As discussed in the Spark computing section of Chapter 8, Learning Analytics on Spark, you may choose one of the following approaches for our kind of projects: Spark on Databrick's platform, Spark on IBM Data Scientist Workbench, SPSS on Spark, or Apache Spark with MLlib alone. You have learned about all the details of utilizing them in the previous chapters, mainly from Chapter 3, A Holistic View on Spark to Chapter 7, Recommendations on Spark.

Either one of the previously mentioned four approaches should work very well for this project of learning from Telco Data. Especially, you may take the codes as developed in this chapter and put them into a separate notebook. You can then implement the notebook with an approach, as mentioned earlier. Using notebooks is preferred for all the approaches, except for SPSS on Spark.

As an exercise and also for the best to fit our data and project goals, we will focus on both the third approach and the fourth approach, which are to use SPSS on Spark and to utilize Apache Spark with MLlib. However, we will also use R notebook on Databrick's platform, as moving cleaned datasets around is not a difficult task.

To use SPSS on Spark, we will need IBM SPSS Modeler 17.1 and IBM SPSS Analytics Server 2.1, which has a good integration with Apache Spark. The following screenshot shows the SPSS Modeler for MLlib node creation:

Spark computing

Through this good integration, data science users of SPSS Modeler can now create new Modeler nodes to exploit MLlib algorithms and share them so that we can combine the third and fourth approaches to implement them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.183.138