There are several inbuilt functions as well as packages for checking the quality of data in R. The most commonly used among them is the summary function in base R:
## Packages Used: ## psych, pastecs, dataMaid, daff # install.packages(c("psych","pastecs","dataMaid","daff")) state <- data.frame(state.x77) state$State <- row.names(state) state summary(state)
The output of the preceding code is as follows:
library(psych) describe(state)
The output of the preceding code is as follows:
You can also use describe.by to get summary information on a per group basis, as shown:
describe.by(state,state$State)
The following is the output:
Or, for a comprehensive statistical description, you can use stat.desc from pastecs, as shown:
library(pastecs) stat.desc(state)
The output of the preceding code is as follows:
Among other utilities, a more recent package, called dataMaid, makes is easy to capture a high-level comparison of all of the data contained in the dataset using a one-line command, as follows:
library(dataMaid) makeDataReport(state)
The output of the preceding code is as follows:
We often need to find differences in datasets when some information changes. This can be done on an iterative basis by inspecting individual columns and so on, but a new package called daff can now be used to get very nice visual renderings of the changes, in a similar fashion to how you may have seen them on sites such as GitHub and elsewhere:
library(daff) state <- data.frame(state.x77) state2 <- state identical(state, state2) state2$Population <- state2$Population+1 diff_data(state,state2)
The output of the preceding code is as follows:
diff_info <- diff_data(state,state2) render_diff(diff_info)
The output of the preceding code is as follows:
You can also patch the data using patch_data and merge datasets using merge_data. More information can be found on the developer's website.