Model estimation

Once the feature sets get finalized in our last section, what follows is to estimate all the parameters of the selected models, for which we have adopted a dynamic approach of using SPSS on Spark, R notebooks in the Databricks environment, and MLlib directly on Spark. For the purpose of organizing workflows better, we focused our effort on organizing all the codes into R notebooks and also coding SPSS Modeler nodes.

For this project, as mentioned earlier, we need to conduct some exploratory analysis for descriptive statistics and for visualization. For this, we can take the MLlib codes and implement them directly. Also, with R codes, we obtained quick and good results.

For the best modelling, we need to arrange distributed computing, especially for this case, with various locations in combination with various customer segments per parents. In the United States, there are 13,506 school districts in 50 states. The difference between states is quite big. For this distributed computing part, you need to refer to the previous chapters. We will use SPSS Analytics Server with Apache Spark as well as the Databricks environment.

As we discussed in the Spark for learning from open data section, we mainly use regression for the supervised machine learning part. From what you have learned so far, for regression, we can complete the model estimation either with SPSS or with R. In order to implement them in the Databricks environment, we should refer to Chapter 3, A Holistic View on Spark and Chapter 5, Risk Scoring on Spark.

Model estimation

SPSS on Spark – SPSS Analytics Server

The IBM SPSS Modeler 17.1 and Analytic Server 2.1 offer easy integration with Spark, which allows us to implement the data and modeling streams built so far easily.

Besides using SPSS Modeler to estimate these predictive models, for which we need to use SPSS Analytics Server, we have also used R notebook in the Databricks environment and Data Scientist WorkBench.

With this, as an example, we have obtained a cluster analysis plot as follows:

SPSS on Spark – SPSS Analytics Server

As an example, with R, we have obtained a PCA plot as follows:

SPSS on Spark – SPSS Analytics Server

Model evaluation

In the previous section, we completed our model estimation as well as some exploratory work. Now it is time for us to evaluate these estimated models to see if they fit our criteria so that we can either move to our next stage for results explanation or go back to some previous stages to refine our predictive models.

To perform our model evaluation, in this section, we have conducted evaluations for cluster analysis and also for PCA. However, our focus is still on assessing predictive models, the regression models with rankings as our target variables. For this task, we will mainly use Root Mean Square Error (RMSE) to assess our models, as it is good for assessing regression models.

Just like we did for model estimation, to calculate RMSEs, we need to use MLlib for regression modeling on Spark. At the same time, we will also use R notebooks to be implemented in the Databricks environment for Spark. Of course, we also used an analytical server for SPSS, as we have adopted a dynamic approach here.

RMSE calculations with MLlib

As used with good results in the past, for MLlib, we can use the following code to calculate RMSE:

val valuesAndPreds = test.map { point =>
  val prediction = new_model.predict(point.features)
  val r = (point.label, prediction)
  r
}
val residuals = valuesAndPreds.map {case (v, p) => math.pow((v - p), 2)}
val MSE = residuals.mean();
val RMSE = math.pow(MSE, 0.5)

Besides the preceding code, MLlib also has some functions in the RegressionMetrics and RankingMetrics classes for us to use for the RMSE calculation.

RMSE calculations with R

In R, the forecast package has an accuracy function that can be used to calculate forecasting accuracy as well as RMSEs:

accuracy(f, x, test=NULL, d=NULL, D=NULL)

The measures calculated also include the following:

  • ME (Mean Error)
  • RMSE (Root Mean Squared Error)
  • MAE (Mean Absolute Error)
  • MPE (Mean Percentage Error)
  • MAPE (Mean Absolute Percentage Error)
  • MASE (Mean Absolute Scaled Error)
  • ACF1 (Autocorrelation of errors at lag 1)

To perform a complete evaluation, we calculated RMSEs for all the models we estimated. Then, we compared and picked up the ones with smaller RMSEs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.70.170