Chapter 9. City Analytics on Spark

Following the strategy adopted in Chapter 8, Learning Analytics on Spark, in this chapter, we will further extend our Spark machine learning application to smart city analytics, for which we will apply machine learning to open data for city analytics. In other words, we will extend the benefit of machine learning on Spark to serving city governments.

Specifically in this chapter, we will first review machine learning methods and related computing for a service request forecasting project and will then discuss how Apache Spark comes in to make them easy. At the same time, with this real-life service forecasting example, we will illustrate step by step our machine learning process of predicting service requests with big data.

Here, we will use the service forecasting project for the purpose of illustrating our technologies and processes. That is, what is described in this chapter is not limited to service request forecasting but can be easily applied to other city analytics projects, such as water usage analytics. Actually, they can be applied to various kinds of machine learning on various kinds of open data, including open data provided by universities and federal agencies, such as that from the well-known LIGO project for gravitational wave detection and studies (for more information on this, refer to http://www.ligo.org/ and http://www.researchmethods.org/AlexLiu_CalTech_Jan21.pdf).

In this chapter, we will cover the following topics:

  • Spark for service forecasting
    • Easy computing with Spark
  • Methods of service forecasting
    • Regression and time series
  • Data and feature preparation
    • Data merging and feature selection
  • Model estimation
    • Model estimation
  • Model evaluation
    • RMSE
    • Results explanation
    • Significant features and trends
  • Model deployment
    • Rules and scoring

Spark for service forecasting

In this section, we will describe a real use case of predicting service requests in detail and then describe how to prepare Apache Spark computing for this real-life project.

The use case

In the United States, and worldwide, more and more cities have made their collected data open to the public. As a result, city governments and many other organizations have performed machine learning on these open datasets with good insight into improving decision making and a lot of positive impact, for example, in New York and Chicago. Using large amount of open data is becoming a trend now. For example, using big data to measure cities is becoming a research trend, as we can note from http://files.meetup.com/11744342/CITY_RANKING_Oct7.pdf.

Using data analytics for cities has a wide impact as more than half of us live in urban centers now, and the percentage is still increasing. Therefore, what you will learn in this chapter will enable data scientists to create a huge positive impact.

Among all the open city datasets, 311 datasets are about service requests from citizens to the city government and are all publicly available, as listed on the following websites for the cities of New York, Los Angeles, Houston, and San Francisco:

These datasets are so rich in detail that many cities are interested in using them to forecast future requests and measure effectiveness. One of our collaborators is tasked to use this data in combination with others to predict service needs for several cities, including Los Angeles and Houston, so that these cities can better allocate their resources accordingly.

Through some preliminary data analysis, the research group understands some of their data challenges as the following:

  • Data quality is not as good as expected; for example, there are a lot of missing cases
  • Data accuracy is another issue to deal with
  • Data exists in different silos that need to be merged together

To deal with the challenges mentioned earlier, for this real project, we utilized some techniques presented in Chapter 2, Data Preparation for Spark ML, to merge all the datasets together and also our Apache Spark technology to treat missing cases to create clean datasets for each city.

As a summary, the following table gives a brief description of these preprocessed datasets:

 

Time Period

# Requests

% Closed

Top Agent

NYC

Sep 2012 ~ Jan 2014

2,138,736

75.3%

HPD

SFO

July 2008 ~ Jan 2014

910,573

95.3%

DPW

Los Angeles

Jan 2011 ~ June 30 2014

2,713,630

?

LADBS

Houston

2012

296,019

98.2%

PWE

As we can note from the preceding table, we only used a part of the available open data, mainly for the reason of data completeness. In other words, we only used data from the period when we have enough datasets to merge and also the data quality for service requests is reasonable as per our initial preprocessing data.

Spark computing

To set up Apache Spark for this project of service request forecasting, we will adopt a strategy similar to that we used in Chapter 8, Learning Analytics on Spark, for the purpose of enforcing our learning. In other words, for the Apache Spark computing part, readers may use this chapter to review what was learned in Chapter 8, Learning Analytics on Spark, as well as what was learned in chapters 1 through 7.

As discussed in the Spark computing section of Chapter 8, Learning Analytics on Spark, you may choose one of the following approaches for our kind of projects:

  • Spark on Databrick's platform
  • Spark on IBM DataScientistWorkbench
  • SPSS on Spark
  • Apache Spark with MLlib alone

You learned all the details of utilizing them in the previous chapters—that is, from Chapter 3, A Holistic View on Spark to Chapter 7, Recommendations on Spark.

Either one of the four approaches mentioned before should work very well for this city analytics project, as described. Specifically, you may take the codes as developed in this chapter, put them into a separate notebook, and then implement the notebook with an approach mentioned previously.

With the strategy described here, similar to what we did in Chapter 8, Learning Analytics on Spark, we will touch on using one of the four approaches in the following section; however, will spend more effort on utilizing a Zeppelin notebook for users to learn more about the Zeppelin approach and to review the technologies described in Chapter 8, Learning Analytics on Spark.

To work with a Zeppelin notework approach, similarly to Chapter 8, Learning Analytics on Spark, we will start with the following page:

Spark computing

Users can click on Create new note, which is the first line under Notebook on the left-hand side column, to start organizing code into the notebook.

Then, a box will open to allow users to type in the notebook's name, and after typing in the name, users can click on Create Note to create a new notebook:

Spark computing

Methods of service forecasting

In the previous section, we described our use case of using open data to forecast service requests and also prepared our Spark computing platform with Zeppelin notebooks as the focus. In our following 4E framework, as the next step in machine learning, we need to complete the task of mapping our use case to machine learning methods; that is, we need to select our analytical methods or predictive models (equations) for this project of predicting service requests with Big Data on Spark.

To model and predict service requests, there are many suitable models, including regression, decision tree, and time series. For this exercise, we will use both regression and time series modeling as time is a significant part of our data, and then, we will use evaluation to determine which one, or whether a combination, of them is the best. However, as regression is already utilized many times in the previous chapters and time series modeling may still be new to some of our readers, we will spend more time on describing and discussing the time series modeling methods.

For our clients of this project—some branches of the city government and civic organizations—the only concerns are whether the service request number will exceed certain levels because problems will follow if so. For this problem, decision tree and random forest are the right methods. However, as an exercise for learning, our focus here will still be on regression and time series modeling because decision tree and random forest are covered many times in our previous chapters. From this method selection discussion, you will understand that we often need to employ a few modeling methods in order to meet clients' needs as well as to achieve best results.

As always, once we finalize our decision for analytical methods or models, we need to prepare the related dependent variable and also prepare for coding.

Regression models

So far, you must know that regression is among the most commonly used methods of prediction and has been utilized for various projects so far.

About regression

As we discussed, there are two kinds of regression modeling that are suitable for various kinds of predictions: one is linear regression and the other is logistic regression. For this project, linear regression can be used when we take daily service request volume as our target variable, while logistic regression can be used if we want to predict whether or not a certain type of service is requested at a certain location during a certain time period.

Preparing for coding

For your convenience in MLlib, we have the following code to be used for linear regression:

val numIterations = 90
val model = LinearRegressionWithSGD.train(TrainingData, numIterations)

For logistic regression, we can use the following code:

val model = new LogisticRegressionWithSGD()
  .setNumClasses(2)

Time series modeling

Our data for this project is of a time series nature. Generally speaking, a time series is a sequence of data points that consists of the following:

  • Successive measurements made over a time interval
  • A continuous time interval

The distance in this time interval between any two consecutive data point is the same. For example, we have parking service requests made on a daily basis so that we have data with the following pattern:

Day 1

Day 2

Day 3

Day 4

Day 5

Day 6

Day 7

Day 8

……

20 requests

31 requests

19 requests

35 requests

22 requests

39 requests

13 requests

28 requests

……

About time series

There are many models specially created to model time series data, such as the ARIMA model, for which algorithms are readily available in R or SPSS.

There are also many introductory materials available to discuss using R to complete time series modeling; some of them are at http://www.stats.uwo.ca/faculty/aim/tsar/tsar.pdf or http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf.

For our time series data of daily service requests, such as SFO data from 2008 to 2014, we plan to use two models: the autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) models. Here, the ARMA model provides a parsimonious description of a (weekly) stationary stochastic process in terms of two polynomials: one for the autoregression and the second for the moving average. The ARIMA model is a generalization of the ARMA model.

Both the ARMA and ARIMA models can provide a good forecast of future service requests. Whether the ARMA or ARIMA model is better will depend on our model evaluation using RMSE.

Preparing for coding

R has many packages for time series modeling, such as the timeSeries or ts package.

When estimating the ARIMA model, we need to use the arima function with a code such as the following:

 fit1<-arima(data1,order=c(1,0,1))

Here, we used c(1,0,1) to specify orders for the ARIMA model.

As for MLlib, the algorithms for time series modeling are still in development. However, some libraries are being developed to facilitate time series modeling on Spark, such as the spark-ts library developed by Cloudera.

This library allows users to preprocess data and then build some simple models as well as evaluate them. The code can be developed in Scala. However, as it is still in development, it is far behind what R can provide.

For an example of using the spark-ts library for time series data modeling, go to http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.141.219