Data and feature preparation

In the section Feature extraction of Chapter 2, Data Preparation for Spark ML, we have reviewed a few methods for feature extraction, and discussed their implementation in Apache Spark. All the techniques discussed there can be applied to the risk scoring project here.

For this project, as mentioned earlier, the main concern is to get everything organized as workflows for repeatability, and possibly automation. So we will adopt OpenRefine for data and feature preparation. We will use OpenRefine within the DataScientistWorkbench environment where it has been integrated.

OpenRefine

OpenRefine, formerly Google Refine, is an open source application for data cleaning.

To use OpenRefine, please go to: https://datascientistworkbench.com/

After logging in, you will see the following screen:

OpenRefine

Then, please click on the OpenRefine button on the upper-right corner of the screen:

OpenRefine

Here, you can import datasets from your computer or from a URL address.

Then you can create an OpenRefine project for data cleaning and preparation. After that, you can export the prepared data, or send the data to a notebook by drag and drop.

For this project, we specially used OpenRefine for identity matching (reconciliation), deleting duplicates, and merging datasets.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.46.92