Chapter 2. Cleaning and Validating Data

In this chapter, we will cover the following recipes:

  • Cleaning data with regular expressions
  • Maintaining consistency with synonym maps
  • Identifying and removing duplicate data
  • Regularizing numbers
  • Calculating relative values
  • Parsing dates and times
  • Lazily processing very large data sets
  • Sampling from very large data sets
  • Fixing spelling errors
  • Parsing custom data formats
  • Validating data with Valip

Introduction

You probably won't spend as much time to get the data as you will in trying to get it into shape. Raw data is often inconsistent, duplicated, or full of holes. Addresses might be missing, years and dates might be formatted in a thousand different ways, or names might be entered into the wrong fields. You'll have to fix these issues before the data is usable.

This is often an iterative, interactive process. If it's a very large dataset, I might create a sample to work with at this stage. Generally, I start by examining the data files. Once I find a problem, I try to code a solution, which I run on the dataset. After each change, I archive the data, either using a ZIP file or, if the data files are small enough, a version control system. Using a version control system is a good option because I can track the code to transform the data along with the data itself and I can also include comments about what I'm doing. Then, I take a look at the data again, and the entire process starts again. Once I've moved on to analyze the entire collection of data, I might find more issues or I might need to change the data somehow in order to make it easier to analyze, and I'm back in the data cleansing loop once more.

Clojure is an excellent tool for this kind of work, because a REPL is a great environment to explore data and fix it interactively. Also, because many of its sequence functions are lazy by default, Clojure makes it easy to work with a lot of data.

This chapter will highlight a few of the many features that Clojure has to clean data. Initially, we'll take a look at regular expressions and some other basic tools. Then, we'll move on to how we can normalize specific kinds of values. The next few recipes will turn our attention to the process of how to handle very large data sets. Finally, we'll take a look at some more sophisticated ways to fix data where we will write a simple spell checker and a custom parser. Finally, the last recipe will introduce you to a Clojure library that has a good DSL to write tests in order to validate your data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.73.175