Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 2. Cleaning and Validating Data

In this chapter, we will cover the following recipes:

Cleaning data with regular expressions
Maintaining consistency with synonym maps
Identifying and removing duplicate data
Regularizing numbers
Calculating relative values
Parsing dates and times
Lazily processing very large data sets
Sampling from very large data sets
Fixing spelling errors
Parsing custom data formats
Validating data with Valip

Introduction

You probably won't spend as much time to get the data as you will in trying to get it into shape. Raw data is often inconsistent, duplicated, or full of holes. Addresses might be missing, years and dates might be formatted in a thousand different ways, or names might be entered into the wrong fields. You'll have to fix these issues before the data is usable.

This is often an iterative, interactive process. If it's a very large dataset, I might create a sample to work with at this stage. Generally, I start by examining the data files. Once I find a problem, I try to code a solution, which I run on the dataset. After each change, I archive the data, either using a ZIP file or, if the data files are small enough, a version control system. Using a version control system is a good option because I can track the code to transform the data along with the data itself and I can also include comments about what I'm doing. Then, I take a look at the data again, and the entire process starts again. Once I've moved on to analyze the entire collection of data, I might find more issues or I might need to change the data somehow in order to make it easier to analyze, and I'm back in the data cleansing loop once more.

Clojure is an excellent tool for this kind of work, because a REPL is a great environment to explore data and fix it interactively. Also, because many of its sequence functions are lazy by default, Clojure makes it easy to work with a lot of data.

This chapter will highlight a few of the many features that Clojure has to clean data. Initially, we'll take a look at regular expressions and some other basic tools. Then, we'll move on to how we can normalize specific kinds of values. The next few recipes will turn our attention to the process of how to handle very large data sets. Finally, we'll take a look at some more sophisticated ways to fix data where we will write a simple spell checker and a custom parser. Finally, the last recipe will introduce you to a Clojure library that has a good DSL to write tests in order to validate your data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 2. Cleaning and Validating Data

Create new playlist

Sign In

Sign Up

Chapter 2. Cleaning and Validating Data

Introduction

Table of Contents for
2. Cleaning and Validating Data