Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Analysis with unsanitized data

Very often, there will be errors or mistakes in data that can severely complicate analyses—especially with public data or data outside of your organization. For example, say there is a stray comma or punctuation mark in a column that was supposed to be numeric. If we aren't careful, R will read this column as character, and subsequent analysis may, in the best case scenario, fail; it is also possible, however, that our analysis will silently chug along, and return an unexpected result. This will happen, for example, if we try to perform linear regression using the punctuation-containing-but-otherwise-numeric column as a predictor, which will compel R to convert it into a factor thinking that it is a categorical variable.

In the worst-case scenario, an analysis with unsanitized data may not error out or return nonsensical results, but return results that look plausible but are actually incorrect. For example, it is common (for some reason) to encode missing data with 999 instead of NA; performing a regression analysis with 999 in a numeric column can severely adulterate our linear models, but often not enough to cause clearly inappropriate results. This mistake may then go undetected indefinitely.

Some problems like these could, rather easily, be detected in small datasets by visually auditing the data. Often, however, mistakes like these are notoriously easy to miss. Further, visual inspection is an untenable solution for datasets with thousands of rows and hundreds of columns. Any sustainable solution must off-load this auditing process to R. But how do we describe aberrant behavior to R so that it can catch mistakes on its own?

The package assertr seeks to do this by introducing a number of data checking verbs. Using assertr grammar, these verbs (functions) can be combined with subjects (data) in different ways to express a rich vocabulary of data validation tasks.

More prosaically, assertr provides a suite of functions designed to verify the assumptions about data early in the analysis process, before any time is wasted computing on bad data. The idea is to provide as much information as you can about how you expect the data to look upfront so that any deviation from this expectation can be dealt with immediately.

Given that the assertr grammar is designed to be able to describe a bouquet of error-checking routines, rather than list all the functions and functionalities that the package provides, it would be more helpful to visit particular use cases.

Two things before we start. First, make sure you install assertr. Second, bear in mind that all data verification verbs in assertr take a data frame to check as their first argument, and either (a) returns the same data frame if the check passes, or (b) produces a fatal error. Since the verbs return a copy of the chosen data frame if the check passes, the main idiom in assertr involves reassignment of the returning data frame after it passes the check.

a_dataset <- CHECKING_VERB(a_dataset, ....)

Checking for out-of-bounds data

It's common for numeric values in a column to have a natural constraint on the values that it should hold. For example, if a column represents a percent of something, we might want to check if all the values in that column are between 0 and 1 (or 0 and 100). In assertr, we typically use the within_bounds function in conjunction with the assert verb to ensure that this is the case. For example, if we added a column to mtcars that represented the percent of heaviest car's weight, the weight of each car is:

library(assertr)
mtcars.copy <- mtcars

mtcars.copy$Percent.Max.Wt <- round(mtcars.copy$wt / 
                                    max(mtcars.copy$wt),
                                    2)

mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),
                     Percent.Max.Wt)

within_bounds is actually a function that takes the lower and upper bounds and returns a predicate, a function that returns TRUE or FALSE. The assert function then applies this predicate to every element of the column specified in the third argument. If there are more than three arguments, assert will assume there are more columns to check.

Using within_bounds, we can also avoid the situation where NA values are specified as "999", as long as the second argument in within_bounds is less than this value.

within_bounds can take other information such as whether the bounds should be inclusive or exclusive, or whether it should ignore the NA values. To see the options for this, and all the other functions in assertr, use the help function on them.

Let's see an example of what it looks like when the assert function fails:

mtcars.copy$Percent.Max.Wt[c(10,15)] <- 2
mtcars.copy <- assert(mtcars.copy, within_bounds(0,1),
                      Percent.Max.Wt)
------------------------------------------------------------
Error: 
Vector 'Percent.Max.Wt' violates assertion 'within_bounds' 2 times (e.g. [2] at index 10)

We get an informative error message that tells us how many times the assertion was violated, and the index and value of the first offending datum.

With assert, we have the option of checking a condition on multiple columns at the same time. For example, none of the measurements in iris can possibly be negative. Here's how we might make sure our dataset is compliant:

iris <- assert(iris, within_bounds(0, Inf),
               Sepal.Length, Sepal.Width,
               Petal.Length, Petal.Width)

# or simply "-Species" because that
# will include all columns *except* Species
iris <- assert(iris, within_bounds(0, Inf),
               -Species)

On occasion, we will want to check elements for adherence to a more complicated pattern. For example, let's say we had a column that we knew was either between -10 and -20, or 10 and 20. We can check for this by using the more flexible verify verb, which takes a logical expression as its second argument; if any of the results in the logical expression is FALSE, verify will cause an error.

vec <- runif(10, min=10, max=20)
# randomly turn some elements negative
vec <- vec * sample(c(1, -1), 10,
                    replace=TRUE)

example <- data.frame(weird=vec)

example <- verify(example, ((weird < 20 & weird > 10) |
                              (weird < -10 & weird > -20)))

# or

example <- verify(example, abs(weird) < 20 & abs(weird) > 10)
# passes

example$weird[4] <- 0
example <- verify(example, abs(weird) < 20 & abs(weird) > 10)
# fails
-------------------------------------
Error in verify(example, abs(weird) < 20 & abs(weird) > 10) : 
  verification failed! (1 failure)

Checking the data type of a column

By default, most of the data import functions in R will attempt to guess the data type for each column at the import phase. This is usually nice, because it saves us from tedious work. However, it can backfire when there are, for example, stray punctuation marks in what are supposed to be numeric columns. To verify this, we can use the assert function with the is.numeric base function:

iris <- assert(iris, is.numeric, -Species)

We can use the is.character and is.logical functions with assert, too.

An alternative method that will disallow the import of unexpected data types is to specify the data type that each column should be at the data import phase with the colClasses optional argument:

iris <- read.csv("PATH_TO_IRIS_DATA.csv",
                 colClasses=c("numeric", "numeric",
                              "numeric", "numeric",
                              "character"))

This solution comes with the added benefit of speeding up the data import process, since R doesn't have to waste time guessing each column's data type.

Checking for unexpected categories

Another data integrity impropriety that is, unfortunately, very common is the mislabeling of categorical variables. There are two types of mislabeling of categories that can occur: an observation's class is mis-entered/mis-recorded/mistaken for that of another class, or the observation's class is labeled in a way that is not consistent with the rest of the labels. To see an example of what we can do to combat the former case, read assertr's vignette. The latter case covers instances where, for example, the species of iris could be misspelled (such as "versicolour", "verginica") or cases where the pattern established by the majority of class names is ignored ("iris setosa", "i. setosa", "SETOSA"). Either way, these misspecifications prove to be a great bane to data analysts for several reasons. For example, an analysis that is predicated upon a two-class categorical variable (for example, logistic regression) will now have to contend with more than two categories. Yet another way in which unexpected categories can haunt you is by producing statistics grouped by different values of a categorical variable; if the categories were extracted from the main data manually—with subset, for example, as opposed to with by, tapply, or aggregate—you'll be missing potentially crucial observations.

If you know what categories you are expecting from the start, you can use the in_set function, in concert with assert, to confirm that all the categories of a particular column are squarely contained within a predetermined set.

# passes
iris <- assert(iris, in_set("setosa", "versicolor",
                            "virginica"), Species)

# mess up the data
iris.copy <- iris
# We have to make the 'Species' column not
# a factor
ris.copy$Species <- as.vector(iris$Species)
iris.copy$Species[4:9] <- "SETOSA"
iris.copy$Species[135] <- "verginica"
iris.copy$Species[95] <- "i. versicolor"

# fails
iris.copy <- assert(iris.copy, in_set("setosa", "versicolor",
                                      "virginica"), Species)
-------------------------------------------
Error: 
Vector 'Species' violates assertion 'in_set' 8 times (e.g. [SETOSA] at index 4)

If you don't know the categories that you should be expecting, a priori, the following incantation, which will tell you how many rows each category contains, may help you identify the categories that are either rare or misspecified:

by(iris.copy, iris.copy$Species, nrow)

Checking for outliers, entry errors, or unlikely data points

Automatic outlier detection (sometimes known as anomaly detection) is something that a lot of analysts scoff at and view as a pipe dream. Though the creation of a routine that automagically detects all erroneous data points with 100 percent specificity and precision is impossible, unmistakably mis-entered data points and flagrant outliers are not hard to detect even with very simple methods. In my experience, there are a lot of errors of this type.

One simple way to detect the presence of a major outlier is to confirm that every data point is within some n number of standard deviations away from the mean of the group. assertr has a function, within_n_sds—in conjunction with the insist verb—to do just this; if we wanted to check that every numeric value in iris is within five standard deviations of its respective column's mean, we could express so thusly:

iris <- insist(iris, within_n_sds(5), -Species)

An issue with using standard deviations away from the mean (z-scores) for detecting outliers is that both the mean and standard deviation are influenced heavily by outliers; this means that the very thing we are trying to detect is obstructing our ability to find it.

There is a more robust measure for finding central tendency and dispersion than the mean and standard deviation: the median and median absolute deviation. The median absolute deviation is the median of the absolute value of all the elements of a vector subtracted by the vector's median.

assertr has a sister to within_n_sds, within_n_mads, that checks every element of a vector to make sure it is within n median absolute deviations away from its column's median.

iris <- insist(iris, within_n_mads(4), -Species)
iris$Petal.Length[5] <- 15
iris <- insist(iris, within_n_mads(4), -Species)
---------------------------------------------
Error: 
Vector 'Petal.Length' violates assertion 'within_n_mads' 1 time (value [15] at index 5)

In my experience, within_n_mads can be an effective guard against illegitimate univariate outliers if n is chosen carefully.

The examples here have been focusing on outlier identification in the univariate case—across one dimension at a time. Often, there are times where an observation is truly anomalous but it wouldn't be evident by looking at the spread of each dimension individually. assertr has support for this type of multivariate outlier analysis, but a full discussion of it would require a background outside the scope of this text.

Chaining assertions

The check assertr aims to make the checking of assumptions so effortless that the user never feels the need to hold back any implicit assumption. Therefore, it's expected that the user uses multiple checks on one data frame.

The usage examples that we've seen so far are really only appropriate for one or two checks. For example, a usage pattern such as the following is clearly unworkable:

iris <- CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTRUCT1(this, ...), ...), ...), ...)

To combat this visual cacophony, assertr provides direct support for chaining multiple assertions by using the "piping" construct from the magrittr package.

The pipe operator of magrittr', %>%, works as follows: it takes the item on the left-hand side of the pipe and inserts it (by default) into the position of the first argument of the function on the right-hand side. The following are some examples of simple magrittr usage patterns:

library(magrittr)
4 %>% sqrt              # 2
iris %>% head(n=3)      # the first 3 rows of iris
iris <- iris %>% assert(within_bounds(0, Inf), -Species)

Since the return value of a passed assertr check is the validated data frame, you can use the magrittr pipe operator to tack on more checks in a way that lends itself to easier human understanding. For example:

iris <- iris %>%
  assert(is.numeric, -Species) %>%
  assert(within_bounds(0, Inf), -Species) %>%
  assert(in_set("setosa", "versicolor", "virginica"), Species) %>%
  insist(within_n_mads(4), -Species)

# or, equivalently

CHECKS <- . %>%
  assert(is.numeric, -Species) %>%
  assert(within_bounds(0, Inf), -Species) %>%
  assert(in_set("setosa", "versicolor", "virginica"), Species) %>%
  insist(within_n_mads(4), -Species)

iris <- iris %>% CHECKS

When chaining assertions, I like to put the most integral and general one right at the top. I also like to put the assertions most likely to be violated right at the top so that execution is terminated before any more checks are run.

There are many other capabilities built into the assertr multivariate outlier checking. For more information about these, read the package's vignette, (vignette("assertr")).

On the magrittr side, besides the forward-pipe operator, this package sports some other very helpful pipe operators. Additionally, magrittr allows the substitution at the right side of the pipe operator to occur at locations other than the first argument. For more information about the wonderful magrittr package, read its vignette.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Analysis with unsanitized data

Create new playlist

Sign In