Feature preparation

In the previous section, we selected our models and also prepared our dependent variable for our supervised machine learning. In this section, we need to move forward to prepare our independent variables, which are all the features representing the factors impacting our dependent variable: the sales team success. Specifically, for this important work, we need to reduce our four hundred of features to a reasonable group for final modeling. For this, we will employ PCA, utilize some subject knowledge, and then perform some feature selection tasks.

PCA

PCA is a very mature and also commonly used feature reduction method that is often used to find a small set of variables that counts for most of the variance. Technically, the goal of PCA is to find a low dimensional subspace that captures as much of the variance of a dataset as possible.

If you are using MLlib, http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pca has a few example codes that users may adopt and modify to run PCA on Spark. For more on MLlib, go to https://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html.

Here for this project, we will use R only for its richness of PCA algorithms. In R, there are at least five functions to compute PCAs, which are as follows:

  • prcomp() (stats)
  • princomp() (stats)
  • PCA() (FactoMineR)
  • dudi.pca() (ade4)
  • acp() (amap)

The prcomp and princomp from the basic package stats are commonly used, and we also have good functions for the results summary and plots. Therefore, we will use these two functions.

Grouping by category to use subject knowledge

As is always the case, if some subject knowledge can be used the feature reduction results can be improved greatly.

For our example, data categories are good to start with. They are:

  • Marketing
  • Training
  • Promotion
  • Team administration
  • Staffing
  • Products

So, we will execute six PCA algorithms, one for each data category. For example, for the Team category, we need to run a PCA algorithm on 73 features or variables to identify factors or dimensions that can fully represent the information we have about TEAM. As for this exercise, we found two dimensions for the Team category's 73 features.

Also, for the Staffing category, we need to execute a PCA algorithm on 103 features or variables to identify the factors or dimensions that can fully represent the information we have about Staffing. As for this exercise, we also found two dimensions for the Staffing category's 103 features. Take a look at the following table:

Category

Number of Factors

Factor Names

Team

2

T1, T2

Marketing

3

M1, M2, M3

Training

3

Tr1, Tr2, Tr3

Staffing

2

S1, S2

Product

4

P1, P2, P3, P4

Promotion

3

Pr1, Pr2, Pr3

Total

17

 

At the end of this PCA exercise, we obtained two to four features for each category, as summarized in the preceding table.

Feature selection

Feature selection is often used to remove redundant or irrelevant features but is often used at least for the following reasons:

  • Making models easier to understand
  • Creating fewer chances for overfitting
  • Saving time and space for model estimation

In MLlib, we can use the ChiSqSelector algorithm, as follows:

// Create ChiSqSelector that will select top 25 of 400 features
val selector = new ChiSqSelector(25)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(TrainingData)

In R, we can use some R packages to make computation easy. Among the available packages, CARET is one of the commonly used packages.

First, as an exercise, we performed feature selection on all the 400 features.

Then, we started with all the features selected from our PCA work. We also performed feature selection so that we could keep all of them.

Therefore, at the end, we have 17 features to use, which are as follows:

Features

T1, T2 for Team

M1, M2, M3 for Marketing

Tr1, Tr2, Tr3 for Training

S1, S2 for Staffing

P1, P2, P3, P4 for Product

Pr1, Pr2, Pr3 for Promotion

Note

For more about feature selection on Spark, go to http://spark.apache.org/docs/latest/mllib-feature-extraction.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.73.207