Chapter 7. Tidying Up Your Data

Data analysis typically flows in a processing pipeline that starts with retrieving data from one or more sources. Upon receipt of this data, it is often the case that it can be in a raw form and can be difficult to use for data analysis. This can be for a multitude of reasons such as data is not recorded, it is lost, or it is just in a different format than what you require.

Therefore, one of the most common things you will do with pandas involves tidying your data, which is the process of preparing raw data for analysis. Showing you how to use various features of pandas to get raw data into a tidy form is the focus of this chapter.

In this chapter, you will learn:

  • The concept of tidy data
  • How pandas represents unknown values
  • How to find NaN values in data
  • How to filter (drop) data
  • What pandas does with unknown values in calculations
  • How to find, filter and fix unknown values
  • How to identify and remove duplicate data
  • How to transform values using replace, map, and apply

What is tidying your data?

Tidy data is a term that was created in what many refer to as a famous data science paper, "Tidy Data" by Hadley Wickham, which I highly recommend that you read and it can be downloaded at http://vita.had.co.nz/papers/tidy-data.pdf. The paper covers many details of the process that he calls tidying data, with the result of the process being that you now have tidy data; data that is ready for analysis.

This chapter will introduce and briefly demonstrate many of the capabilities of pandas. We will not get into all of the details of the paper, but as an opening to what we will cover, I would like to create a brief summary of the reasons why you need to tidy data and what are the characteristics of tidy data, so that you know you have completed the task and are ready to move on to analysis.

Tidying of data is required for many reasons including these:

  • The names of the variables are different from what you require
  • There is missing data
  • Values are not in the units that you require
  • The period of sampling of records is not what you need
  • Variables are categorical and you need quantitative values
  • There is noise in the data,
  • Information is of an incorrect type
  • Data is organized around incorrect axes
  • Data is at the wrong level of normalization
  • Data is duplicated

This is quite a list, and it is very likely that I have missed a few points. In working with data, I have seen all of these issues at one time or another, or many of them at once. Fixing these can often be very difficult in programming languages, such as Java or C#, and often cause exceptions at the worst times (such as in production of a high-volume trading system).

Moving away from a list of problems with data that needs to be addressed, there are several characteristics of data that can be considered good, tidy, and ready for analysis, which are as follows:

  • Each variable is in one column
  • Each observation of the variable is in a different row
  • There should be one table for each kind of variable
  • If multiple tables, they should be relatable
  • Qualitative and categorical variables have mappings to values useful for analysis

Fortunately, pandas has been designed to make dealing with all of these issues as painless as possible and you will learn how to address most of these issues in the remainder of this chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.195.111