Chapter 10. Advanced Data Management

When we discuss data analysis, we usually think of the operations performed on data that yield new insights about whatever phenomena the data reflect. However, as a prelude to doing such operations, it is better to clean up the data that we start with and wrangle it into an analyzable form. Unfortunately, such wrangling typically occupies at least as much time as (if not more than) the actual analysis in most real world projects. Thus, data management is probably one of the most useful skills in data analysis, and it is given ample coverage in books on database programming, but little coverage in most texts on R.

Data wrangling is a term that is applied to activities that make data more usable by changing their form, but not their meaning. Data wrangling may involve reformatting data, mapping data from one data model to another, or converting data into more consumable forms. Such data wrangling activities make it easier to submit data to a database or repository, load data into analysis software, publish it to the Internet, compare datasets, or otherwise make data more accessible, usable, and shareable in different settings.

In this chapter, we will discuss the following topics:

  • Cleaning up datasets
  • Pattern matching
  • Floating point operations and numerical data types
  • Memory management
  • Missing data and multiple imputation

We will focus on data types, data structures, and messy data in this chapter. While this is usually not considered the exciting part of data analysis, this is where most data analysts will likely spend the majority of their time, unless they are fortunate to have extremely well curated datasets to work with.

Cleaning datasets in R

The first step in any data analysis is preparing the data for the analysis. The rest of this chapter will mostly deal with this topic, but here we will review some basic considerations and R techniques. The most important part of any data analysis is to know the dataset and to have some idea of how each of the variables in the dataset was created.

For a basic overview, we will use the pumpkin dataset, which is short and artificial. Have a look at all of the following data in it:

pumpkins <- read.csv('messy_pumpkins.txt', stringsAsFactors = FALSE)
> pumpkins
      weight      location
1        2.3        europe
2      2.4kg       Europee
3     3.1 kg           USA
4 2700 grams United States
5         24          U.S.

Tip

When loading data frames, R's default behavior is to treat strings as categorical factors rather than as literal strings. This is usually the desired behavior of a dataset with consistently denoted factors but a problem if the same factors have been denoted with different strings. If we wish to treat the strings as strings, we can pass the stringsAsFactors = FALSE command in the read.csv command.

As can be seen from the previous data, the weights are written in different ways and in different units. The locations don't have consistency, and there is misspelling. This is hopefully messier than most datasets that you will have to work with, but this is the kind of problem that is frequently encountered in large datasets, especially when good efforts are not made to ensure high quality data entry up front.

Notably, the 24 seen in the fifth row is a very different number than the rest of the values, and has no units attached. Do we assume it is kilograms and someone left out the decimal point when entering the data? Do we ignore it as completely unreliable? This is not a statistical question but a substantive one. Here we will assume that a decimal point was actually ignored and it is in kilograms.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.102.50