Methods for a holistic view

As discussed in the previous section, in this section, we need to select our analytical methods or models (equations) to complete the task of mapping our business use case to machine learning methods.

To assess the impact of various factors on the sales team's success, there are many suitable models for us to use. As an exercise, we will select (a) regression models, (b) structural equation models, and (c) decision trees, mainly for their ease of interpretation as well as their implementablility on Spark.

Once we finalize our decision for analytical methods or models, we will need to prepare the dependent variable and also prepare for coding; we will discuss these one by one in the following section.

Regression modeling

To get ready for regression modeling on Spark, there are three issues for us to take care of:

  • Linear regression or logistic regression.

    Regression is the most mature and also the most widely used model to represent the impact of various factors on one dependent variable. Whether to use linear regression or logistic regression depends on whether the relationship is linear or not. Here, we are not sure, so we will adopt both and then compare their results to decide on which to deploy.

  • Preparing the dependent variable.

    In order to use logistic regression, we need to recode the target variable or dependent variable (the sales team's success variable, now with a rating from 0 to 100) to be 0 versus 1 by separating it with the medium value.

  • Preparing coding.

    In MLlib, we can use the following code for regression modeling, as we will use Spark MLlib's Linear Regression with Stochastic Gradient Descent (LinearRegressionWithSGD):

    val numIterations = 90
    val model = LinearRegressionWithSGD.train(TrainingData, numIterations)

    For Logistic regression, we use the following code:

    val model = new LogisticRegressionWithSGD()
      .setNumClasses(2)
      .run(training)

For more information about using MLlib for regression modeling, you can go to:

http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-least-squares-lasso-and-ridge-regression.

In R, we can use the lm function for linear regression, and the glm function for logistic regression with family=binomial().

The SEM approach

To get ready for Structural Equation Modeling (SEM) on Spark, there are also three issues for us to take care of:

  • SEM introduction specification.

    SEM may be considered an extension of regression modeling as it consists of several linear equations similar to regression equations. However, this method estimates all the equations at the same time with regard to their internal relations, so it is less biased than regression modeling. SEM consists of both structural modeling and latent variable modeling; however, we will only use structural modeling.

  • Preparing the dependent variable.

    We can just use the sales team's success scale (with a rating of 0 to 100) as our target variable here.

  • Preparing the coding.

    We will adopt R notebook within the Databricks environment, for which we should use the SEM R package. There are also other SEM packages, such as lavaan, available for use; however, for this project, we will use the sem package for its ease of learning.

    To load an SEM package into the R notebook, we will use install.packages("sem", repos="http://R-Forge.R-project.org"). Then, we need to execute the R code, library(sem).

    After this, we need to use the specify.model() function to write some code to specify the models into our R notebook, for which the following code is needed:

    mod.no1 <- specifyModel()
    s1 <- x1, gam31
    s1 <- x2, gam32

Decision trees

To get ready for the decision tree modeling on Spark, there are again three issues for us to take care of:

  • Decision tree selection

    Decision trees aim to model classifying cases, which is about classifying them into successful or not successful for our use case in sequence. It is also one of the most mature and widely used methods. It could even lead to overfitting, which requires methods such as afterward regularization. For this exercise, we will only use the simple linear decision tree and not venture into any more complicated trees such as random forests.

  • Preparing the dependent variable

    To use the decision tree model here, we will separate the sales team ratings into two categories—SUCCESS and NOT—as we did for logistic regression.

  • Preparing the coding

    For MLlib, we can use the following code:

    val numClasses = 2
    val categoricalFeaturesInfo = Map[Int, Int]()
    val impurity = "gini"
    val maxDepth = 6
    val maxBins = 32
    val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
      impurity, maxDepth, maxBins)

For more information on using MLlib for decision trees, go to http://spark.apache.org/docs/latest/mllib-decision-tree.html.

As for the R notebook on Spark, we need to use the rpart R package and then the rpart functions for all the calculation. For rpart, we need to specify the classifier and also all the features to be used.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.7.102