Chapter 11. Modeling Open Data on Spark

Following what we did in Chapter 10, Learning Telco Data on Spark, in this chapter, we will further extend our Apache Spark machine learning to a project of learning from open data. In Chapter 9, City Analytics on Spark, we already applied machine learning to open data, where we built models to predict service requests. Here, we will further move up into a new level where we will explore machine learning approaches of turning more open data into useful insights, as well as building models to score school districts or schools for academic achievements, technologies, and others. After that, we will build predictive models to explain what impacts the ranking and scoring of these districts.

To follow the good structure established early, in this chapter, we will still first review machine learning methods and related computing for this real-life project of learning from open data. We will then set up Apache Spark computing. At the same time, with our real-life learning examples, we will further illustrate our step-by-step machine learning process with big data. However, far beyond this, we will further demonstrate the benefits of our dynamic approaches as taken in the last chapter, Chapter 10, Learning Telco Data on Spark, which will allow us to generate results quickly. We will then quickly adjust ourselves to go deep in machine learning to generate even more insights. In other words, as you are expected to have already gained a much better knowledge of Spark computing and the related tools, including R and SPSS, at this stage, we will jump around the 4Es as needed. Also, we will not be limited by working only on one project or on one model or a specific process. Therefore, especially for this chapter, we will just work as needed to discover insights, score districts, and then build predictive models of the newly developed scores so that we can solve clients' problems better.

Here, we aim to illustrate our technologies and processes using these real-life projects of learning from open data. However, what is described in this chapter is not limited to district scoring and ranking, but can also be easily applied to other scoring and ranking projects, such as to score and rank corporations or countries. Also, they actually can be applied to various kinds of machine learning on various kinds of open datasets. In this chapter, we will cover the following topics:

  • Spark for learning from open data
  • Methods for scoring and ranking
  • Data and feature preparation
  • Model estimation
  • Model evaluation
  • Results explanation
  • Model deployment

Spark for learning from open data

In this section, we will describe our real-life use case of learning from open data, and then describe how to prepare Apache Spark computing for our real-life projects.

The use case

As discussed in Chapter 9, City Analytics on Spark, in the United States and worldwide, more and more governments at various levels have made their collected data openly available to the public. As a result of expanding analytics of open data, many governmental and social organizations have used these open datasets to improve service to citizens, with a lot of good results recorded, such as in https://www.socrata.com/video/added-value-open-datas-internal-use-case/. Using data analytics for cities has a huge impact as more than half of us live in urban centers now, and this urban residence percentage is higher and higher every year.

Especially, using big data to measure communities is also favored by researchers and practitioners, as we can see at http://files.meetup.com/11744342/CITY_RANKING_Oct7.pdf. Many cities have policy initiatives to measure communities or even smaller units such as streets with good results and data available for public use, such as that from Los Angeles at http://lahub.maps.arcgis.com/apps/MapJournal/index.html?appid=7279dc87ea9e416d9f90bf844505a54a. Using available open data and computing tools just to create some measurements and rankings may be easy. However, creating an accurate and object ranking of certain properties of some communities is not an easy task. Here, we are asked to use available open data, in combination with other datasets, such as census data and social media data, to improve rankings of communities, with a focus on school districts or schools.

At the same time, we are also asked to explore the available open data and try to model it with available machine learning tools on Spark. In other words, for this project, besides developing a good score to measure and rank communities, we are also asked for special insights to be developed from our dynamic machine learning. Once the ranking is ready, we are even asked to explore the rankings with more machine learning, which makes this project really dynamic, as aided by the ease and speed of Spark computing.

However, everything starts from data, as we found out in Chapter 9, City Analytics on Spark. The datasets are not as good as we expected, and they have the following issues for us to deal with:

  • Data quality is not as good as expected. For example, there are a lot of missing cases.
  • Data accuracy is another issue to deal with.
  • Data exists in different silos, which need to be merged together.

Therefore, we will still need to perform a big task of data cleaning and feature preparation. We are lucky that we already have a good process from data to equation, estimation, evaluation, and explanation.

For this work of learning from open data, as we took a dynamic approach, the research team became interested in educational data and gradually turned our focus to the work of ranking school districts with open data.

With regard to this subject, we found some open data about schools at https://www.ed-data.k12.ca.us/Pages/Home.aspx.

The state government of California also has some open data at http://data.ca.gov/category/by-data-format/data-files/.

Spark computing

As discussed earlier, like in the Spark computing section of Chapter 8, Learning Analytics on Spark, you may choose one of the following approaches for our kind of projects:

  • Spark on Databrick's platform
  • Spark on IBM DataScientistWorkbench
  • SPSS on Spark
  • Apache Spark with MLlib alone

You have already learned in detail about utilizing them one by one in the previous chapters, mainly from Chapter 3, A Holistic View on Spark to Chapter 7, Recommendations on Spark.

Either one of the preceding four approaches should work very well for our projects of learning from open data here. Specially, you may also take the codes as developed in this chapter, and put them into a separate notebook. You can then implement the notebook with one of the preceding approaches.

As an exercise and also for the best to fit our big amount of open data and project goals of fast ranking, we will need to work with the fourth approach, which is to utilize Apache Spark with MLlib. However, we will also need to use R programming a lot for better visualization and reporting, so that we will utilize our first approach of Spark in Databrick's platform as well. At the same time, to take advantage of some good PCA algorithms in SPSS and to easily develop related workflows, we will also need to use SPSS on Spark to practice a special dynamic approach of utilizing Apache Spark. Finally, to meet the needs of creating many data-cleaning workflows, we will also need to use the DataScientistWorkBench platform, with which we can use OpenRefine.

Let's review the approaches mentioned previously in brief to get us really prepared.

As we discussed in the Spark computing for machine learning section of Chapter 1, Spark for Machine Learning, Apache Spark has a unified platform that consists of the Spark core engine and four libraries, which are Spark SQL, Spark Streaming, MLlib, and GraphX:

Spark computing

As MLlib is Apache Spark's built-in machine learning library, it is relatively easy to set up and scale for our project of learning from open data.

To work within the Databricks environment, we need to perform the following steps to set up clusters:

  1. First, you need to go to the main menu and click on Clusters. Then, a window will open up for you to write a name for the cluster. You can select a Spark version and then specify the number of workers:
    Spark computing
  2. Once clusters have been created, we can go to the main menu, click on the down arrow on the right-hand side of Tables, and then choose Create Tables to import our open datasets that are cleaned and prepared, as shown in the following screenshot:
    Spark computing

To utilize IBM Data Scientist Workbench, we need to go to https://datascientistworkbench.com/:

Spark computing

As shown in the preceding screenshot, Data Scientist Workbench has Apache Spark installed and also has a data cleaning system, OpenRefine, integrated so that our data preparation work can be made easier and more organized:

Spark computing

For this project, we will use Data Scientist Workbench for data cleaning and also a little for R notebook creation, as well as Apache Spark implementation. For this setup, some of the Apache Spark techniques described in the previous chapters should apply.

With regards to SPSS on Spark, we will use IBM SPSS Modeler 17.1 and IBM SPSS Analytics Server 2.1, which has a good integration with Apache Spark.

Methods for scoring and ranking

In the previous section, we described our use case of learning from open data, with a focus on using open data to score and rank communities, and also prepared our Spark computing platform with R notebooks, SPSS workflows, and MLlib codes to use. As the next step for our machine learning per our 4E framework, we need to complete a task of mapping our use case to machine learning methods, which is to select our analytical methods or predictive models (equations) for this project of scoring and ranking with open data on Spark.

To turn data into insights, we will need to explore many methods, for which our dynamic approach should work well. To develop scores and rankings, it is not a difficult task with our available analytical tools and fast computing. However, to obtain objective and accurate raking is not an easy job. One approach to achieve this is to ensemble many rankings together, as it will improve results dramatically per past research. Visit http://www.researchmethods.org/Ranking-indicators and http://www.researchmethods.org/InnovativeAnalysisSociety for more information.

Therefore, for this project, we will take a dynamic approach, with which we will explore our open data, with methods including cluster analysis and principal component analysis. We will then use this knowledge to build a few rankings and scores. After that, we will ensemble results to improve rankings and scores. Finally, we will develop models to explain the impact of various features on these rankings and scores. However, as we are taking a dynamic approach, we will jump between these stages to achieve optimal results. As always, once we finalize our decision for analytical methods or models, we will need to prepare the related dependent variable and also prepare for coding.

Cluster analysis

Both Spark MLlib and R have algorithms available for cluster analysis:

// Cluster the data into two classes using KMeans
val numClusters = 2
val numIterations = 20
val clusters = KMeans.train(parsedData, numClusters, numIterations)

// Evaluate clustering by computing Within Set Sum of Squared Errors
val WSSSE = clusters.computeCost(parsedData)
println("Within Set Sum of Squared Errors = " + WSSSE)

In R, we can use some R codes. Here is an example:

# K-Means Cluster Analysisfit <- kmeans(schooldata, 5) # 5 cluster solution
# get cluster means aggregate(schooldata,by=list(fit$cluster),FUN=mean)

For more about cluster analysis with Spark MLlib, go to http://spark.apache.org/docs/latest/mllib-clustering.html.

Principal component analysis

Both Spark MLlib and R have algorithms available for principal component analysis (PCA):

// Compute the top 10 principal components.
val pc: Matrix = mat.computePrincipalComponents(10) // Principal components are stored in a local dense matrix.

// Project the rows to the linear space spanned by the top 10 principal components.
val projected: RowMatrix = mat.multiply(pc)

In R, we can use the prcomp function from the stats package.

For more on PCA with Spark MLlib, go to http://spark.apache.org/docs/latest/mllib-dimensionality-reduction.html.

Besides cluster analysis and PCA, we will also use regression modelling and decision tree modelling to help us understand more about how communities fall into one category or one rank rather than others.

Regression models

So far, you must know that regression is among the most commonly used models for prediction, and has been utilized for various projects so far.

As we discussed, there are two kinds of regression modeling that are suitable for various kinds of predictions. One is linear regression and another is logistic regression. For this project, linear regression can be used when we take daily service request volume as our target variable, while logistic regression can be used if we want to predict whether or not a certain type of service is requested in a certain location at a certain time period.

For your convenience, in MLlib, for linear regression, we have the following code to be used:

val numIterations = 90
val model = LinearRegressionWithSGD.train(TrainingData, numIterations)

For logistic regresssion, we can use these following code:

val model = new LogisticRegressionWithSGD()
.setNumClasses(2)

In R, as we did earlier, we will use the GLM and LM functions for linear regression modeling and logistic regression modeling.

Score resembling

Once the scores get developed, one of the easy ways of resembling them is to construct a linear combination with special weighting for each. The weights can be developed with subject knowledge or with machine learning.

Besides the preceding things, there are also a few R packages for score ensemble.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.144.108