Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Data and feature preparation

In the section Feature extraction of Chapter 2, Data Preparation for Spark ML, we have reviewed a few methods for feature extraction, and discussed their implementation in Apache Spark. All the techniques discussed there can be applied to the risk scoring project here.

For this project, as mentioned earlier, the main concern is to get everything organized as workflows for repeatability, and possibly automation. So we will adopt OpenRefine for data and feature preparation. We will use OpenRefine within the DataScientistWorkbench environment where it has been integrated.

OpenRefine

OpenRefine, formerly Google Refine, is an open source application for data cleaning.

To use OpenRefine, please go to: https://datascientistworkbench.com/

After logging in, you will see the following screen:

Then, please click on the OpenRefine button on the upper-right corner of the screen:

Here, you can import datasets from your computer or from a URL address.

Then you can create an OpenRefine project for data cleaning and preparation. After that, you can export the prepared data, or send the data to a notebook by drag and drop.

For this project, we specially used OpenRefine for identity matching (reconciliation), deleting duplicates, and merging datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Data and feature preparation

Create new playlist

Sign In

Sign Up

Data and feature preparation

OpenRefine

Table of Contents for
Data and feature preparation