Following the strategy adopted in Chapter 8, Learning Analytics on Spark, in this chapter, we will further extend our Spark machine learning application to smart city analytics, for which we will apply machine learning to open data for city analytics. In other words, we will extend the benefit of machine learning on Spark to serving city governments.
Specifically in this chapter, we will first review machine learning methods and related computing for a service request forecasting project and will then discuss how Apache Spark comes in to make them easy. At the same time, with this real-life service forecasting example, we will illustrate step by step our machine learning process of predicting service requests with big data.
Here, we will use the service forecasting project for the purpose of illustrating our technologies and processes. That is, what is described in this chapter is not limited to service request forecasting but can be easily applied to other city analytics projects, such as water usage analytics. Actually, they can be applied to various kinds of machine learning on various kinds of open data, including open data provided by universities and federal agencies, such as that from the well-known LIGO project for gravitational wave detection and studies (for more information on this, refer to http://www.ligo.org/ and http://www.researchmethods.org/AlexLiu_CalTech_Jan21.pdf).
In this chapter, we will cover the following topics:
In this section, we will describe a real use case of predicting service requests in detail and then describe how to prepare Apache Spark computing for this real-life project.
In the United States, and worldwide, more and more cities have made their collected data open to the public. As a result, city governments and many other organizations have performed machine learning on these open datasets with good insight into improving decision making and a lot of positive impact, for example, in New York and Chicago. Using large amount of open data is becoming a trend now. For example, using big data to measure cities is becoming a research trend, as we can note from http://files.meetup.com/11744342/CITY_RANKING_Oct7.pdf.
Using data analytics for cities has a wide impact as more than half of us live in urban centers now, and the percentage is still increasing. Therefore, what you will learn in this chapter will enable data scientists to create a huge positive impact.
Among all the open city datasets, 311 datasets are about service requests from citizens to the city government and are all publicly available, as listed on the following websites for the cities of New York, Los Angeles, Houston, and San Francisco:
These datasets are so rich in detail that many cities are interested in using them to forecast future requests and measure effectiveness. One of our collaborators is tasked to use this data in combination with others to predict service needs for several cities, including Los Angeles and Houston, so that these cities can better allocate their resources accordingly.
Through some preliminary data analysis, the research group understands some of their data challenges as the following:
To deal with the challenges mentioned earlier, for this real project, we utilized some techniques presented in Chapter 2, Data Preparation for Spark ML, to merge all the datasets together and also our Apache Spark technology to treat missing cases to create clean datasets for each city.
As a summary, the following table gives a brief description of these preprocessed datasets:
Time Period |
# Requests |
% Closed |
Top Agent | |
---|---|---|---|---|
NYC |
Sep 2012 ~ Jan 2014 |
2,138,736 |
75.3% |
HPD |
SFO |
July 2008 ~ Jan 2014 |
910,573 |
95.3% |
DPW |
Los Angeles |
Jan 2011 ~ June 30 2014 |
2,713,630 |
? |
LADBS |
Houston |
2012 |
296,019 |
98.2% |
PWE |
As we can note from the preceding table, we only used a part of the available open data, mainly for the reason of data completeness. In other words, we only used data from the period when we have enough datasets to merge and also the data quality for service requests is reasonable as per our initial preprocessing data.
To set up Apache Spark for this project of service request forecasting, we will adopt a strategy similar to that we used in Chapter 8, Learning Analytics on Spark, for the purpose of enforcing our learning. In other words, for the Apache Spark computing part, readers may use this chapter to review what was learned in Chapter 8, Learning Analytics on Spark, as well as what was learned in chapters 1 through 7.
As discussed in the Spark computing section of Chapter 8, Learning Analytics on Spark, you may choose one of the following approaches for our kind of projects:
You learned all the details of utilizing them in the previous chapters—that is, from Chapter 3, A Holistic View on Spark to Chapter 7, Recommendations on Spark.
Either one of the four approaches mentioned before should work very well for this city analytics project, as described. Specifically, you may take the codes as developed in this chapter, put them into a separate notebook, and then implement the notebook with an approach mentioned previously.
With the strategy described here, similar to what we did in Chapter 8, Learning Analytics on Spark, we will touch on using one of the four approaches in the following section; however, will spend more effort on utilizing a Zeppelin notebook for users to learn more about the Zeppelin approach and to review the technologies described in Chapter 8, Learning Analytics on Spark.
To work with a Zeppelin notework approach, similarly to Chapter 8, Learning Analytics on Spark, we will start with the following page:
Users can click on Create new note, which is the first line under Notebook on the left-hand side column, to start organizing code into the notebook.
Then, a box will open to allow users to type in the notebook's name, and after typing in the name, users can click on Create Note to create a new notebook:
In the previous section, we described our use case of using open data to forecast service requests and also prepared our Spark computing platform with Zeppelin notebooks as the focus. In our following 4E framework, as the next step in machine learning, we need to complete the task of mapping our use case to machine learning methods; that is, we need to select our analytical methods or predictive models (equations) for this project of predicting service requests with Big Data on Spark.
To model and predict service requests, there are many suitable models, including regression, decision tree, and time series. For this exercise, we will use both regression and time series modeling as time is a significant part of our data, and then, we will use evaluation to determine which one, or whether a combination, of them is the best. However, as regression is already utilized many times in the previous chapters and time series modeling may still be new to some of our readers, we will spend more time on describing and discussing the time series modeling methods.
For our clients of this project—some branches of the city government and civic organizations—the only concerns are whether the service request number will exceed certain levels because problems will follow if so. For this problem, decision tree and random forest are the right methods. However, as an exercise for learning, our focus here will still be on regression and time series modeling because decision tree and random forest are covered many times in our previous chapters. From this method selection discussion, you will understand that we often need to employ a few modeling methods in order to meet clients' needs as well as to achieve best results.
As always, once we finalize our decision for analytical methods or models, we need to prepare the related dependent variable and also prepare for coding.
So far, you must know that regression is among the most commonly used methods of prediction and has been utilized for various projects so far.
As we discussed, there are two kinds of regression modeling that are suitable for various kinds of predictions: one is linear regression and the other is logistic regression. For this project, linear regression can be used when we take daily service request volume as our target variable, while logistic regression can be used if we want to predict whether or not a certain type of service is requested at a certain location during a certain time period.
For your convenience in MLlib, we have the following code to be used for linear regression:
val numIterations = 90 val model = LinearRegressionWithSGD.train(TrainingData, numIterations)
For logistic regression, we can use the following code:
val model = new LogisticRegressionWithSGD() .setNumClasses(2)
Our data for this project is of a time series nature. Generally speaking, a time series is a sequence of data points that consists of the following:
The distance in this time interval between any two consecutive data point is the same. For example, we have parking service requests made on a daily basis so that we have data with the following pattern:
Day 1 |
Day 2 |
Day 3 |
Day 4 |
Day 5 |
Day 6 |
Day 7 |
Day 8 |
…… |
---|---|---|---|---|---|---|---|---|
20 requests |
31 requests |
19 requests |
35 requests |
22 requests |
39 requests |
13 requests |
28 requests |
…… |
There are many models specially created to model time series data, such as the ARIMA model, for which algorithms are readily available in R or SPSS.
There are also many introductory materials available to discuss using R to complete time series modeling; some of them are at http://www.stats.uwo.ca/faculty/aim/tsar/tsar.pdf or http://www.statoek.wiso.uni-goettingen.de/veranstaltungen/zeitreihen/sommer03/ts_r_intro.pdf.
For our time series data of daily service requests, such as SFO data from 2008 to 2014, we plan to use two models: the autoregressive moving average (ARMA) and autoregressive integrated moving average (ARIMA) models. Here, the ARMA model provides a parsimonious description of a (weekly) stationary stochastic process in terms of two polynomials: one for the autoregression and the second for the moving average. The ARIMA model is a generalization of the ARMA model.
Both the ARMA and ARIMA models can provide a good forecast of future service requests. Whether the ARMA or ARIMA model is better will depend on our model evaluation using RMSE.
R has many packages for time series modeling, such as the timeSeries or ts package.
When estimating the ARIMA model, we need to use the arima function with a code such as the following:
fit1<-arima(data1,order=c(1,0,1))
Here, we used c(1,0,1)
to specify orders for the ARIMA model.
As for MLlib, the algorithms for time series modeling are still in development. However, some libraries are being developed to facilitate time series modeling on Spark, such as the spark-ts
library developed by Cloudera.
This library allows users to preprocess data and then build some simple models as well as evaluate them. The code can be developed in Scala. However, as it is still in development, it is far behind what R can provide.
For an example of using the spark-ts
library for time series data modeling, go to http://blog.cloudera.com/blog/2015/12/spark-ts-a-new-library-for-analyzing-time-series-data-with-apache-spark/.
3.147.74.211