Chapter 8. Polishing Data

When working with data, you will usually find that it may not always be perfect or clean in the means of missing values, outliers and similar anomalies. Handling and cleaning imperfect or so-called dirty data is part of every data scientist's daily life, and even more, it can take up to 80 percent of the time we actually deal with the data!

Dataset errors are often due to the inadequate data acquisition methods, but instead of repeating and tweaking the data collection process, it is usually better (in the means of saving money, time and other resources) or unavoidable to polish the data by a few simple functions and algorithms. In this chapter, we will cover:

  • Different use cases of the na.rm argument of various functions
  • The na.action and related functions to get rid of missing data
  • Several packages that offer a user-friendly way of data imputation
  • The outliers package with several statistical tests for extreme values
  • How to implement Lund's outlier test on our own as a brain teaser
  • Referring to some robust methods

The types and origins of missing data

First, we have to take a quick look at the possible different sources of missing data to identify why and how we usually get missing values. There are quite a few different reasons for data loss, which can be categorized into 3 different types.

For example, the main cause of missing data might be a malfunctioning device or the human factor of incorrectly entering data. Missing Completely at Random (MCAR) means that every value in the dataset has the same probability of being missed, so no systematic error or distortion is to be expected due to missing data, and nor can we explain the pattern of missing values. This is the best situation if we have NA (meaning: no answer, not applicable or not available) values in our data set.

But a much more frequent and unfortunate type of missing data is Missing at Random (MAR) compared to MCAR. In the case of MAR, the pattern of missing values is known or at least can be identified, although it has nothing to do with the actual missing values. For example, one might think of a population where males are more loners or lazier compared to females, thus they prefer not to answer all the questions in a survey – regardless of the actual question. So it's not that the males are not giving away their salary due to the fact that they make more or less compared to females, but they tend to skip a few questions in the questionnaire at random.

Note

This classification and typology of missing data was first proposed by Donald B. Rubin in 1976 in his Inference and Missing Data, published in Biometrika 63(3): 581—592, later reviewed and extended in a book jointly written by Roderick J. A. Little (2002): Statistical Analysis with Missing Data, Wiley – which is well worth of reading for further details.

And the worst scenario would be Missing Not at Random (MNAR), where data is missing for a specific reason that is highly related to the actual question, which classifies missing values as nonignorable non-response.

This happens pretty often in surveys with sensitive questions or due to design flaws in the research preparation. In such cases, data is missing due to some latent process going on in the background, which is often the thing we wanted to come to know better with the help of the research – which can turn out to be a rather cumbersome situation.

So how can we resolve these problems? Sometimes it's relatively easy. For example, if we have lot of observations, MCAR is not a real problem at all due to the law of large numbers, as the probability of having missing value(s) is the same for each observation. We basically have two options to deal with unknown or missing data:

  • Removing missing values and/or observations
  • Replacing missing values with some estimates
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.134.154