Data and feature preparation

In the Feature extraction section of Chapter 2, Data Preparation for Spark ML, we reviewed a few methods of feature extraction and discussed their implementation in Apache Spark. All the techniques discussed there can be applied to our data here.

Besides feature development, for this project, we will also need to spend a lot of effort in merging various datasets together to obtain more features.

Therefore, for this project, we actually need to conduct feature development, then data merging, and then feature selection, which is to utilize all the techniques discussed in Chapter 2, Data Preparation for Spark ML and Chapter 3, A Holistic View on Spark.

Data merging

To obtain features for predicting, we need to add some external datasets, including weather data from National Weather Service Forecast Office, events as well as calendar data from the Open Data portal, and socio-economic data for each zip code block from census data source.

In the, Joining data section of Chapter 2, Data Preparation for Spark ML, we described methods to join data together with Spark SQL and other tools. All the techniques described there as well as the ones about identity matching and data cleaning techniques described in Chapter 2, Data Preparation for Spark ML, can be used in this chapter.

As for this data merging task, the main focus includes, firstly, merging data on date per day, and, secondly, merging data on location per zip code. That is, first we will reorganize all the 311 requests' data into one dataset with features per day, which is to obtain the number of requests per day and other daily features. Then, the second task is similar; we will reorganize all the 311 requests' data into another dataset with features per location (here, per zip code), to obtain features such as the number of service requests per zip code. To learn how to reorganize datasets, readers may refer to the Data reorganizing section in Chapter 2, Data Preparation for Spark ML.

After we have created the two datasets mentioned previously, we will merge the first dataset with weather and calendar data and the second dataset with census data.

After merging with events data and calendar data, we will obtain new features for "whether holiday", special events, weekdays versus weekend, and others.

After merging with weather data, we will obtain new features for rainy, snowy, average temperature, temperature range of the day, and other variables.

On the location side, we will work on the zip code level so that after merging with census data, we will obtain some new features about employment, income level, race, and others.

Feature selection

Taking the New York city 311 data as an example, we have more than 50 features in the data, which include information about the time requests that were made, locations for services, government agencies to whom the services request, the types of services requested, and the processing time for requests as well as the results of these requests.

After we merged location-related datasets and time-related datasets as described in the previous section, we will have more than 100 features ready to be used.

As for the feature selection for this project, we could follow what was used in Chapter 8, Learning Analytics on Spark, which is to utilize principal component analysis (PCA) and subject knowledge to group features and then apply machine learning for final feature selection. However, as an exercise, we will not repeat what was learned but will try something different. That is, we will let the machine learning algorithm pick up the features most useful in prediction.

In MLlib, we can use the ChiSqSelector algorithm as follows:

// Create ChiSqSelector that will select top 25 of 400 features
val selector = new ChiSqSelector(25)
// Create ChiSqSelector model (selecting features)
val transformer = selector.fit(TrainingData)

In R, we can use some R packages to make computation easy. Among the available packages, CARET is one of the commonly used packages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.218.78.102