Data and feature preparation

Everyone who has worked with open data will agree that a huge amount of time is needed to clean datasets, with a lot of work to be completed to take care of data accuracy and data incompleteness.

Also, one main task is to merge all the datasets together, as we have separate datasets for crime, education, resource usage, request demand, and transportation from the open datasets. We also have datasets from some separate sources, including census.

In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods for feature extraction and discussed their implementation on Apache Spark. All the techniques discussed there can be applied to our data here.

Besides data merging, we will also need to spend a lot of time on feature development, as we need features to develop our models to obtain insights for this project.

Therefore, for this project, we actually need to conduct data merging, and then feature development and selection, which is to utilize all the techniques discussed in Chapter 2, Data Preparation for Spark ML, and also in Chapter 3, A Holistic View on Spark.

Data cleaning

To obtain good datasets to use, a lot of work needs to be completed to clean our data, especially in taking care of the data accuracy issue and missing values.

Due to the big demand for data cleaning, we have adopted a few special approaches and actually also a dynamic approach for us to use a few tools in cleaning the datasets and then combine them for our machine learning. Specially, we have also used OpenRefine as discussed in Chapter 5, Risk Scoring on Spark. OpenRefine, formerly Google Refine, is an open source application for data cleaning.

Some of our team members have also used OpenRefine directly. For more information on for using OpenRefine directly, go to http://openrefine.org/.

To use OpenRefine on Data Scientist WorkBench, first go to https://datascientistworkbench.com/.

After login, we will see the following screenshot:

Data cleaning

Then, click on the OpenRefine button in the top-right corner:

Data cleaning

From here, we can import datasets from your computer or from a URL address.

Then, we can create an OpenRefine project to do data cleaning and preparation. After that, we can export the prepared data or send the data to a notebook by drag and drop.

For this project, we especially used OpenRefine for identity matching (reconciliation), duplicates deleting, and then a little bit of dataset merging.

Besides using OpenRefine, some of our members have cleaned sample data. They have then programmed the procedures for distributed computing for data cleaning, especially for taking care of some data mistakes.

Data merging

In the Joining data section of Chapter 2, Data Preparation for Spark ML, we described methods to join data together with Spark SQL and other tools. All the data techniques described in Chapter 2, Data Preparation for Spark ML, as well as the ones about identity matching and data cleaning techniques will be used.

As for this data merging task, the main focus is to merge data on location per zip code and also per school districts. That is, first, we need to work on identity analytics to ensure that we have good IDs for matching.

Then, we merge datasets.

After that, we reorganize datasets into a format suitable for the methods we selected in the previous section.

For information about how to reorganize datasets, you may refer to the Data reorganizing section of Chapter 2, Data Preparation for Spark ML.

Specifically, we start with simple data at https://www.ed-data.k12.ca.us/Pages/Home.aspx.

Then, we merge a few datasets, such as weather data, census data, and city educational datasets, into it.

After that, we reorganize all the data to obtain features per school district and per academic term.

Feature development

As an exercise, we have also used some social media data and worked to develop features from it.

One easy feature for social media is the social influence score for the principal of the school, which I suspect is not very useful. However, to obtain the social influence scores for all the students or for all the teachers is difficult.

As for the web data, we obtained some log data for each school's website. Using some similar methods to those used in Chapter 4, Fraud Detection on Spark, we extracted some features from the web log data. Specifically, to parse them and to make sense of them, we used some subject knowledge. With that, our team worked manually with some sample data. Then, they used the patterns discovered to develop codes in R to parse and turn extracted information into features. These features include the number of clicks, time between clicks, type of clicks, and other features, which were used to construct interaction features for the schools.

Feature selection

After the work mentioned in the preceding section, we have more than 100 features ready to be used.

As for the feature selection for this project, we could follow what we used for Chapter 8, Learning Analytics on Spark, which was to utilize PCA and also to use subject knowledge to group features, and then apply machine learning for its final feature selection. However, as an exercise, you will not repeat what you learned, but will try something different. That is, we will let the machine learning algorithms pick up the features most useful in prediction.

In MLlib, we can use the ChiSqSelector algorithm as follows:

// Create ChiSqSelector that will select top 25 of 400 features
val selector = new ChiSqSelector(25)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(TrainingData)

In R, we can use some R packages to make computation easy. Among the available packages, CARET is one of the commonly used packages.

After this, we will end with a large amount of data with the following list of our sample features:

  • School name
  • School ID
  • Graduation ratio
  • Dropout ratio
  • Average score from state exam 1
  • Average score from state exam 2
  • Social media participation score
  • Web interactions
  • Parent participation
  • Outdoor activities
  • Mobility
  • Technology usage
  • College connection

Besides this, we have also obtained a dataset with a school district as a unit, for which school averages were calculated as each district has more than one school.

So, besides the preceding features, we also have data for school district for the following features:

  • Economics
  • Crime
  • Business

We have all the data from 2000 to 2015.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.73.175