In the previous section, we selected our models and also prepared our dependent variable for our supervised machine learning. In this section, we need to move forward to prepare our independent variables, which are all the features representing the factors impacting our dependent variable: the sales team success. Specifically, for this important work, we need to reduce our four hundred of features to a reasonable group for final modeling. For this, we will employ PCA, utilize some subject knowledge, and then perform some feature selection tasks.
PCA is a very mature and also commonly used feature reduction method that is often used to find a small set of variables that counts for most of the variance. Technically, the goal of PCA is to find a low dimensional subspace that captures as much of the variance of a dataset as possible.
If you are using MLlib, http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pca has a few example codes that users may adopt and modify to run PCA on Spark. For more on MLlib, go to https://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html.
Here for this project, we will use R only for its richness of PCA algorithms. In R, there are at least five functions to compute PCAs, which are as follows:
prcomp()
(stats)princomp()
(stats)PCA()
(FactoMineR)dudi.pca()
(ade4)acp()
(amap)The prcomp
and princomp
from the basic package stats are commonly used, and we also have good functions for the results summary and plots. Therefore, we will use these two functions.
As is always the case, if some subject knowledge can be used the feature reduction results can be improved greatly.
For our example, data categories are good to start with. They are:
So, we will execute six PCA algorithms, one for each data category. For example, for the Team category, we need to run a PCA algorithm on 73 features or variables to identify factors or dimensions that can fully represent the information we have about TEAM. As for this exercise, we found two dimensions for the Team category's 73 features.
Also, for the Staffing category, we need to execute a PCA algorithm on 103 features or variables to identify the factors or dimensions that can fully represent the information we have about Staffing. As for this exercise, we also found two dimensions for the Staffing category's 103 features. Take a look at the following table:
Category |
Number of Factors |
Factor Names |
---|---|---|
Team |
2 |
T1, T2 |
Marketing |
3 |
M1, M2, M3 |
Training |
3 |
Tr1, Tr2, Tr3 |
Staffing |
2 |
S1, S2 |
Product |
4 |
P1, P2, P3, P4 |
Promotion |
3 |
Pr1, Pr2, Pr3 |
Total |
17 |
At the end of this PCA exercise, we obtained two to four features for each category, as summarized in the preceding table.
Feature selection is often used to remove redundant or irrelevant features but is often used at least for the following reasons:
In MLlib, we can use the ChiSqSelector
algorithm, as follows:
// Create ChiSqSelector that will select top 25 of 400 features val selector = new ChiSqSelector(25) // Create ChiSqSelector model (selecting features) val transformer = selector.fit(TrainingData)
In R, we can use some R packages to make computation easy. Among the available packages, CARET
is one of the commonly used packages.
First, as an exercise, we performed feature selection on all the 400 features.
Then, we started with all the features selected from our PCA work. We also performed feature selection so that we could keep all of them.
Therefore, at the end, we have 17 features to use, which are as follows:
Features |
---|
T1, T2 for Team |
M1, M2, M3 for Marketing |
Tr1, Tr2, Tr3 for Training |
S1, S2 for Staffing |
P1, P2, P3, P4 for Product |
Pr1, Pr2, Pr3 for Promotion |
For more about feature selection on Spark, go to http://spark.apache.org/docs/latest/mllib-feature-extraction.html.
3.145.73.207