Data and feature development

In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we have reviewed a few methods for feature extraction and discussed their implementation on Apache Spark. All the techniques discussed there will be applied to our datasets here.

Besides feature development, for this project, we will also need to spend a lot of effort to merge various datasets together to obtain more features.

Therefore, for this project, we actually need to conduct feature development, then data merging and reorganizing, and then feature selection, which is to utilize all the techniques discussed in Chapter 2, Data Preparation for Spark ML, and also in Chapter 3, A Holistic View on Spark. A lot of work has been completed to produce several good datasets for this big project, with the techniques described earlier.

As an exercise, we will focus on some of the key tasks, which are to reorganize data per day, then merge datasets, and finally conduct feature selection to obtain a good set of features for machine learning.

Data reorganizing

To obtain more and good features to predict and use the data to serve the clients of the telco company, we need to add some external datasets, including customer purchase data and some open data.

In the Joining data section of Chapter 2, Data Preparation for Spark ML, we have described methods to join data with Spark SQL and other tools. All the techniques described there as well as the ones about identity matching and data cleaning techniques described in Chapter 2, Data Preparation for Spark ML could be used here.

As for this data-reorganizing task, the main focus in here includes (1) to aggregate data on date per day and (2) to aggregate data on location per zip code as well as per location types. That is, first, we reorganize all data into one dataset with features per day, which is to obtain the number of calls per day and other daily features. Then, the second task is similar, but to reorganize all the data into another dataset with features per location, here per zip code. It means to obtain features such as number of calls per zip code. About how to reorganize datasets, you may refer to the Data reorganizing section of Chapter 2, Data Preparation for Spark ML.

Specifically, all the tools to be used have good functions for data aggregation.

SPSS has a function of aggregate, for which we just need to specify the date or location as a break and specify the sum or mean as the function to create new data.

R also has a function of aggregate, for which we will need to use by to specify the date as a break and then use FUN to specify the function to create new data.

After we have these two datasets created, we can merge the first dataset with customer data.

Feature development and selection

After we merged location-related datasets and time-related datasets, as described in the previous section, we have more than 100 features ready to be used.

As for the feature selection for this project, we could follow what we used for Chapter 8, Learning Analytics on Spark, which is to utilize (Principal Component Analysis (PCA) and also to use subject knowledge to group features, and then apply machine learning for final feature selection. However, as an exercise, you will not repeat what you learned. We will try something different. That is, we will let the machine learning algorithms pick up the features most useful in prediction.

In MLlib, we can use the ChiSqSelector algorithm as follows:

// Create ChiSqSelector that will select top 25 of 400 features
val selector = new ChiSqSelector(25)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(TrainingData)

In R, we can use some R packages to make computation easy. Among the available packages, CARET is one of the commonly used packages.

With all the work described in the preceding sections, Data reorganizing and Feature development and selection, is done, we end with a dataset with the following features to be used:

  • Basic info: location – state, account service length, area code, phone number, phone mftr, international call plan, and voice mail plan
  • Usage info: number vmail messages daily, total day minutes, total day calls, total calls dropped, total day charge, total eve minutes, total eve calls, total eve charge, total night minutes, total night calls, total night charge, total intl minutes, total intl calls, total intl charge, number Call Center calls, and call locations

Especially for our study, we also have a special feature that is about whether or not the subscriber churned, which will be the essential target variable for our core supervised machine learning. In the preceding list of features, the second to the last is number Call Center calls, which will be also used as a target variable for some of our supervised machine learning.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.16.76.138