Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Feature preparation

In the previous section, we selected our models and also prepared our dependent variable for our supervised machine learning. In this section, we need to move forward to prepare our independent variables, which are all the features representing the factors impacting our dependent variable: the sales team success. Specifically, for this important work, we need to reduce our four hundred of features to a reasonable group for final modeling. For this, we will employ PCA, utilize some subject knowledge, and then perform some feature selection tasks.

PCA

PCA is a very mature and also commonly used feature reduction method that is often used to find a small set of variables that counts for most of the variance. Technically, the goal of PCA is to find a low dimensional subspace that captures as much of the variance of a dataset as possible.

If you are using MLlib, http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html#principal-component-analysis-pca has a few example codes that users may adopt and modify to run PCA on Spark. For more on MLlib, go to https://spark.apache.org/docs/1.2.1/mllib-dimensionality-reduction.html.

Here for this project, we will use R only for its richness of PCA algorithms. In R, there are at least five functions to compute PCAs, which are as follows:

prcomp() (stats)
princomp() (stats)
PCA() (FactoMineR)
dudi.pca() (ade4)
acp() (amap)

The prcomp and princomp from the basic package stats are commonly used, and we also have good functions for the results summary and plots. Therefore, we will use these two functions.

Grouping by category to use subject knowledge

As is always the case, if some subject knowledge can be used the feature reduction results can be improved greatly.

For our example, data categories are good to start with. They are:

Marketing
Training
Promotion
Team administration
Staffing
Products

So, we will execute six PCA algorithms, one for each data category. For example, for the Team category, we need to run a PCA algorithm on 73 features or variables to identify factors or dimensions that can fully represent the information we have about TEAM. As for this exercise, we found two dimensions for the Team category's 73 features.

Also, for the Staffing category, we need to execute a PCA algorithm on 103 features or variables to identify the factors or dimensions that can fully represent the information we have about Staffing. As for this exercise, we also found two dimensions for the Staffing category's 103 features. Take a look at the following table:

Category	Number of Factors	Factor Names
Team	2	T1, T2
Marketing	3	M1, M2, M3
Training	3	Tr1, Tr2, Tr3
Staffing	2	S1, S2
Product	4	P1, P2, P3, P4
Promotion	3	Pr1, Pr2, Pr3
Total	17

At the end of this PCA exercise, we obtained two to four features for each category, as summarized in the preceding table.

Feature selection

Feature selection is often used to remove redundant or irrelevant features but is often used at least for the following reasons:

Making models easier to understand
Creating fewer chances for overfitting
Saving time and space for model estimation

In MLlib, we can use the ChiSqSelector algorithm, as follows:

// Create ChiSqSelector that will select top 25 of 400 features
val selector = new ChiSqSelector(25)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(TrainingData)

In R, we can use some R packages to make computation easy. Among the available packages, CARET is one of the commonly used packages.

First, as an exercise, we performed feature selection on all the 400 features.

Then, we started with all the features selected from our PCA work. We also performed feature selection so that we could keep all of them.

Therefore, at the end, we have 17 features to use, which are as follows:

Features
T1, T2 for Team
M1, M2, M3 for Marketing
Tr1, Tr2, Tr3 for Training
S1, S2 for Staffing
P1, P2, P3, P4 for Product
Pr1, Pr2, Pr3 for Promotion

Note

For more about feature selection on Spark, go to http://spark.apache.org/docs/latest/mllib-feature-extraction.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Feature preparation

Create new playlist

Sign In

Sign Up

Feature preparation

PCA

Grouping by category to use subject knowledge

Feature selection

Note

Table of Contents for
Feature preparation