Very often, there will be errors or mistakes in data that can severely complicate analyses—especially with public data or data outside of your organization. For example, say there is a stray comma or punctuation mark in a column that was supposed to be numeric
. If we aren't careful, R will read this column as character,
and subsequent analysis may, in the best case scenario, fail; it is also possible, however, that our analysis will silently chug along, and return an unexpected result. This will happen, for example, if we try to perform linear regression using the punctuation-containing-but-otherwise-numeric column as a predictor, which will compel R to convert it into a factor
thinking that it is a categorical variable.
In the worst-case scenario, an analysis with unsanitized data may not error out or return nonsensical results, but return results that look plausible but are actually incorrect. For example, it is common (for some reason) to encode missing data with 999
instead of NA
; performing a regression analysis with 999
in a numeric
column can severely adulterate our linear models, but often not enough to cause clearly inappropriate results. This mistake may then go undetected indefinitely.
Some problems like these could, rather easily, be detected in small datasets by visually auditing the data. Often, however, mistakes like these are notoriously easy to miss. Further, visual inspection is an untenable solution for datasets with thousands of rows and hundreds of columns. Any sustainable solution must off-load this auditing process to R. But how do we describe aberrant behavior to R so that it can catch mistakes on its own?
The package assertr
seeks to do this by introducing a number of data checking verbs. Using assertr
grammar, these verbs (functions) can be combined with subjects (data) in different ways to express a rich vocabulary of data validation tasks.
More prosaically, assertr
provides a suite of functions designed to verify the assumptions about data early in the analysis process, before any time is wasted computing on bad data. The idea is to provide as much information as you can about how you expect the data to look upfront so that any deviation from this expectation can be dealt with immediately.
Given that the assertr
grammar is designed to be able to describe a bouquet of error-checking routines, rather than list all the functions and functionalities that the package provides, it would be more helpful to visit particular use cases.
Two things before we start. First, make sure you install assertr
. Second, bear in mind that all data verification verbs in assertr
take a data frame to check as their first argument, and either (a) returns the same data frame if the check passes, or (b) produces a fatal error. Since the verbs return a copy of the chosen data frame if the check passes, the main idiom in assertr
involves reassignment of the returning data frame after it passes the check.
a_dataset <- CHECKING_VERB(a_dataset, ....)
It's common for numeric values in a column to have a natural constraint on the values that it should hold. For example, if a column represents a percent of something, we might want to check if all the values in that column are between 0 and 1 (or 0 and 100). In assertr
, we typically use the within_bounds
function in conjunction with the assert
verb to ensure that this is the case. For example, if we added a column to mtcars
that represented the percent of heaviest car's weight, the weight of each car is:
library(assertr) mtcars.copy <- mtcars mtcars.copy$Percent.Max.Wt <- round(mtcars.copy$wt / max(mtcars.copy$wt), 2) mtcars.copy <- assert(mtcars.copy, within_bounds(0,1), Percent.Max.Wt)
within_bounds
is actually a function that takes the lower and upper bounds and returns a predicate, a function that returns TRUE
or FALSE
. The assert
function then applies this predicate to every element of the column specified in the third argument. If there are more than three arguments, assert
will assume there are more columns to check.
Using within_bounds,
we can also avoid the situation where NA
values are specified as "999", as long as the second argument in within_bounds
is less than this value.
within_bounds
can take other information such as whether the bounds should be inclusive or exclusive, or whether it should ignore the NA
values. To see the options for this, and all the other functions in assertr
, use the help
function on them.
Let's see an example of what it looks like when the assert
function fails:
mtcars.copy$Percent.Max.Wt[c(10,15)] <- 2 mtcars.copy <- assert(mtcars.copy, within_bounds(0,1), Percent.Max.Wt) ------------------------------------------------------------ Error: Vector 'Percent.Max.Wt' violates assertion 'within_bounds' 2 times (e.g. [2] at index 10)
We get an informative error message that tells us how many times the assertion was violated, and the index and value of the first offending datum.
With assert
, we have the option of checking a condition on multiple columns at the same time. For example, none of the measurements in iris
can possibly be negative. Here's how we might make sure our dataset is compliant:
iris <- assert(iris, within_bounds(0, Inf), Sepal.Length, Sepal.Width, Petal.Length, Petal.Width) # or simply "-Species" because that # will include all columns *except* Species iris <- assert(iris, within_bounds(0, Inf), -Species)
On occasion, we will want to check elements for adherence to a more complicated pattern. For example, let's say we had a column that we knew was either between -10 and -20, or 10 and 20. We can check for this by using the more flexible verify
verb, which takes a logical expression as its second argument; if any of the results in the logical expression is FALSE
, verify
will cause an error.
vec <- runif(10, min=10, max=20) # randomly turn some elements negative vec <- vec * sample(c(1, -1), 10, replace=TRUE) example <- data.frame(weird=vec) example <- verify(example, ((weird < 20 & weird > 10) | (weird < -10 & weird > -20))) # or example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # passes example$weird[4] <- 0 example <- verify(example, abs(weird) < 20 & abs(weird) > 10) # fails ------------------------------------- Error in verify(example, abs(weird) < 20 & abs(weird) > 10) : verification failed! (1 failure)
By default, most of the data import functions in R will attempt to guess the data type for each column at the import phase. This is usually nice, because it saves us from tedious work. However, it can backfire when there are, for example, stray punctuation marks in what are supposed to be numeric columns. To verify this, we can use the assert
function with the is.numeric
base function:
iris <- assert(iris, is.numeric, -Species)
We can use the is.character
and is.logical
functions with assert
, too.
An alternative method that will disallow the import of unexpected data types is to specify the data type that each column should be at the data import phase with the colClasses
optional argument:
iris <- read.csv("PATH_TO_IRIS_DATA.csv", colClasses=c("numeric", "numeric", "numeric", "numeric", "character"))
This solution comes with the added benefit of speeding up the data import process, since R doesn't have to waste time guessing each column's data type.
Another data integrity impropriety that is, unfortunately, very common is the mislabeling of categorical variables. There are two types of mislabeling of categories that can occur: an observation's class is mis-entered/mis-recorded/mistaken for that of another class, or the observation's class is labeled in a way that is not consistent with the rest of the labels. To see an example of what we can do to combat the former case, read assertr
's vignette. The latter case covers instances where, for example, the species of iris could be misspelled (such as "versicolour", "verginica") or cases where the pattern established by the majority of class names is ignored ("iris setosa", "i. setosa", "SETOSA"). Either way, these misspecifications prove to be a great bane to data analysts for several reasons. For example, an analysis that is predicated upon a two-class categorical variable (for example, logistic regression) will now have to contend with more than two categories. Yet another way in which unexpected categories can haunt you is by producing statistics grouped by different values of a categorical variable; if the categories were extracted from the main data manually—with subset
, for example, as opposed to with by
, tapply
, or aggregate
—you'll be missing potentially crucial observations.
If you know what categories you are expecting from the start, you can use the in_set
function, in concert with assert
, to confirm that all the categories of a particular column are squarely contained within a predetermined set.
# passes iris <- assert(iris, in_set("setosa", "versicolor", "virginica"), Species) # mess up the data iris.copy <- iris # We have to make the 'Species' column not # a factor ris.copy$Species <- as.vector(iris$Species) iris.copy$Species[4:9] <- "SETOSA" iris.copy$Species[135] <- "verginica" iris.copy$Species[95] <- "i. versicolor" # fails iris.copy <- assert(iris.copy, in_set("setosa", "versicolor", "virginica"), Species) ------------------------------------------- Error: Vector 'Species' violates assertion 'in_set' 8 times (e.g. [SETOSA] at index 4)
If you don't know the categories that you should be expecting, a priori, the following incantation, which will tell you how many rows each category contains, may help you identify the categories that are either rare or misspecified:
by(iris.copy, iris.copy$Species, nrow)
Automatic outlier detection (sometimes known as anomaly detection) is something that a lot of analysts scoff at and view as a pipe dream. Though the creation of a routine that automagically detects all erroneous data points with 100 percent specificity and precision is impossible, unmistakably mis-entered data points and flagrant outliers are not hard to detect even with very simple methods. In my experience, there are a lot of errors of this type.
One simple way to detect the presence of a major outlier is to confirm that every data point is within some n number of standard deviations away from the mean of the group. assertr
has a function, within_n_sds
—in conjunction with the insist
verb—to do just this; if we wanted to check that every numeric value in iris
is within five standard deviations of its respective column's mean, we could express so thusly:
iris <- insist(iris, within_n_sds(5), -Species)
An issue with using standard deviations away from the mean (z-scores) for detecting outliers is that both the mean and standard deviation are influenced heavily by outliers; this means that the very thing we are trying to detect is obstructing our ability to find it.
There is a more robust measure for finding central tendency and dispersion than the mean and standard deviation: the median and median absolute deviation. The median absolute deviation is the median of the absolute value of all the elements of a vector subtracted by the vector's median.
assertr
has a sister to within_n_sds
, within_n_mads
, that checks every element of a vector to make sure it is within n median absolute deviations away from its column's median.
iris <- insist(iris, within_n_mads(4), -Species) iris$Petal.Length[5] <- 15 iris <- insist(iris, within_n_mads(4), -Species) --------------------------------------------- Error: Vector 'Petal.Length' violates assertion 'within_n_mads' 1 time (value [15] at index 5)
In my experience, within_n_mads
can be an effective guard against illegitimate univariate outliers if n is chosen carefully.
The examples here have been focusing on outlier identification in the univariate case—across one dimension at a time. Often, there are times where an observation is truly anomalous but it wouldn't be evident by looking at the spread of each dimension individually. assertr
has support for this type of multivariate outlier analysis, but a full discussion of it would require a background outside the scope of this text.
The check assertr
aims to make the checking of assumptions so effortless that the user never feels the need to hold back any implicit assumption. Therefore, it's expected that the user uses multiple checks on one data frame.
The usage examples that we've seen so far are really only appropriate for one or two checks. For example, a usage pattern such as the following is clearly unworkable:
iris <- CHECKING_CONSTRUCT4(CHECKING_CONSTRUCT3(CHECKING_CONSTRUCT2(CHECKING_CONSTRUCT1(this, ...), ...), ...), ...)
To combat this visual cacophony, assertr
provides direct support for chaining multiple assertions by using the "piping" construct from the magrittr
package.
The pipe operator of magrittr
', %>%
, works as follows: it takes the item on the left-hand side of the pipe and inserts it (by default) into the position of the first argument of the function on the right-hand side. The following are some examples of simple magrittr
usage patterns:
library(magrittr) 4 %>% sqrt # 2 iris %>% head(n=3) # the first 3 rows of iris iris <- iris %>% assert(within_bounds(0, Inf), -Species)
Since the return value of a passed assertr
check is the validated data frame, you can use the magrittr
pipe operator to tack on more checks in a way that lends itself to easier human understanding. For example:
iris <- iris %>% assert(is.numeric, -Species) %>% assert(within_bounds(0, Inf), -Species) %>% assert(in_set("setosa", "versicolor", "virginica"), Species) %>% insist(within_n_mads(4), -Species) # or, equivalently CHECKS <- . %>% assert(is.numeric, -Species) %>% assert(within_bounds(0, Inf), -Species) %>% assert(in_set("setosa", "versicolor", "virginica"), Species) %>% insist(within_n_mads(4), -Species) iris <- iris %>% CHECKS
When chaining assertions, I like to put the most integral and general one right at the top. I also like to put the assertions most likely to be violated right at the top so that execution is terminated before any more checks are run.
There are many other capabilities built into the assertr
multivariate outlier checking. For more information about these, read the package's vignette, (vignette("assertr")
).
On the magrittr
side, besides the forward-pipe operator, this package sports some other very helpful pipe operators. Additionally, magrittr
allows the substitution at the right side of the pipe operator to occur at locations other than the first argument. For more information about the wonderful magrittr
package, read its vignette.
3.144.98.153