Model estimation

Once the feature sets get finalized in our last section, what follows is an estimation of all the parameters of the selected models, for which we adopted the approach of using MLlib on the Zeppeline notebook for this project and R notebooks in the Databricks environment because we need to estimate some regression and time series models.

Similarly to before, for the best modeling, we need to arrange distributed computing, especially for this case with various kinds of services. In other words, we will estimate models to predict the daily volume of each kind of service request, which is for heating, construction-related, noise-related, parking-related, and other service requests.

In order to complete this task of estimating models for various service types, we need to group all the services into a set of service types. However, for this exercise, we just selected 50 top service types and then conducted parallel computing for model estimation for all these 50 services.

For this distributed computing part, readers may refer to the model estimation parts in previous chapters, as we will not repeat all the details here. Overall, as we discussed in the Methods of service forecasting section of this chapter, we will focus on using two methods: regression and time series modeling. From what we have learned so far, for regression, we can complete the model estimation with Zeppelin notebooks on Spark using MLlib algorithms. As for time series modeling, it is better to use R so that we can implement them in the Databricks or IBM DataScientistWorkbench environment, for which we should refer to Chapter 3, A Holistic View on Spark and Chapter 5, Risk Scoring on Spark.

Spark implementation with the Zeppelin notebook

As discussed, to utilize regression to predict the daily volume of service requests, we have the following:

  • Location-related features, such as employment ratios
  • Weather features
  • Event-related features

In the Data and feature preparation section, we had the data prepared; now, we need to divide them into a training and a test set. So, we can use the training set for model estimation.

In MLlib, for linear regression, we will use the following code:

val numIterations = 90
val model = LinearRegressionWithSGD.train(TrainingData, numIterations)

For logistic regresssion, we will use the following code:

val model = new LogisticRegressionWithSGD()
  .setNumClasses(2)

We need to input the preceding code into the Zeppelin notebook as follows:

Spark implementation with the Zeppelin notebook

Then we can press Shift + Enter to run the commands to obtain results, as shown in the following screenshot:

Spark implementation with the Zeppelin notebook

Spark implementation with the R notebook

For time series modeling, as discussed, we will use R notebooks within the Databricks environment similarly to what we did in Chapter 3, A Holistic View on Spark.

To do so, we can use the Databricks environment's Job feature. Specifically, within the Databricks environment, we can go to Jobs and create jobs, as shown in the following screenshot:

Spark implementation with the R notebook

Then users can select notebooks to run, specify clusters, and schedule jobs. Once scheduled, users can also monitor the running and then collect results back.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.247.11