Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 7. Tidying Up Your Data

Data analysis typically flows in a processing pipeline that starts with retrieving data from one or more sources. Upon receipt of this data, it is often the case that it can be in a raw form and can be difficult to use for data analysis. This can be for a multitude of reasons such as data is not recorded, it is lost, or it is just in a different format than what you require.

Therefore, one of the most common things you will do with pandas involves tidying your data, which is the process of preparing raw data for analysis. Showing you how to use various features of pandas to get raw data into a tidy form is the focus of this chapter.

In this chapter, you will learn:

The concept of tidy data
How pandas represents unknown values
How to find NaN values in data
How to filter (drop) data
What pandas does with unknown values in calculations
How to find, filter and fix unknown values
How to identify and remove duplicate data
How to transform values using replace, map, and apply

What is tidying your data?

Tidy data is a term that was created in what many refer to as a famous data science paper, "Tidy Data" by Hadley Wickham, which I highly recommend that you read and it can be downloaded at http://vita.had.co.nz/papers/tidy-data.pdf. The paper covers many details of the process that he calls tidying data, with the result of the process being that you now have tidy data; data that is ready for analysis.

This chapter will introduce and briefly demonstrate many of the capabilities of pandas. We will not get into all of the details of the paper, but as an opening to what we will cover, I would like to create a brief summary of the reasons why you need to tidy data and what are the characteristics of tidy data, so that you know you have completed the task and are ready to move on to analysis.

Tidying of data is required for many reasons including these:

The names of the variables are different from what you require
There is missing data
Values are not in the units that you require
The period of sampling of records is not what you need
Variables are categorical and you need quantitative values
There is noise in the data,
Information is of an incorrect type
Data is organized around incorrect axes
Data is at the wrong level of normalization
Data is duplicated

This is quite a list, and it is very likely that I have missed a few points. In working with data, I have seen all of these issues at one time or another, or many of them at once. Fixing these can often be very difficult in programming languages, such as Java or C#, and often cause exceptions at the worst times (such as in production of a high-volume trading system).

Moving away from a list of problems with data that needs to be addressed, there are several characteristics of data that can be considered good, tidy, and ready for analysis, which are as follows:

Each variable is in one column
Each observation of the variable is in a different row
There should be one table for each kind of variable
If multiple tables, they should be relatable
Qualitative and categorical variables have mappings to values useful for analysis

Fortunately, pandas has been designed to make dealing with all of these issues as painless as possible and you will learn how to address most of these issues in the remainder of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 7. Tidying Up Your Data

Create new playlist

Sign In

Sign Up

Chapter 7. Tidying Up Your Data

What is tidying your data?

Table of Contents for
7. Tidying Up Your Data