As we discussed in this chapter's preface, there are countless ways that a dataset may be messy. There are many other messy situations and solutions that we couldn't discuss at length here. In order that you, dear reader, are not left in the dark regarding custodial solutions, here are some other remedies which you may find helpful along your analytics journey:
Though OpenRefine (formerly Google Refine) doesn't have anything to do with R per se, it is a sophisticated tool for working with and for cleaning up messy data. Among its numerous, sophisticated capabilities is the capacity to auto-detect misspelled or mispecified categories and fix them at the click of a button.
Suppose you find that there are commas separating every third digit of the numbers in a numeric column. How would you remove them? Or suppose you needed to strip a currency symbol from values in columns that hold monetary values so that you can compute with them as numbers. These, and vastly more complicated text transformations, can be performed using regular expressions (a formal grammar for specifying the search patterns in text) and associate R functions like grep
and sub
. Any time spent learning regular expressions will pay enormous dividends over your career as an analyst, and there are many great, free tutorials available on the web for this purpose.
There are a few different ways in which you can represent the same tabular dataset. In one form—called long, narrow, stacked, or entity-attribute-value model—each row contains an observation ID, a variable name, and the value of that variable. For example:
member attribute value 1 Ringo Starr birthyear 1940 2 Paul McCartney birthyear 1942 3 George Harrison birthyear 1943 4 John Lennon birthyear 1940 5 Ringo Starr instrument Drums 6 Paul McCartney instrument Bass 7 George Harrison instrument Guitar 8 John Lennon instrument Guitar
In another form (called wide or unstacked), each of the observation's variables are stored in each column:
member birthyear instrument 1 George Harrison 1943 Guitar 2 John Lennon 1940 Guitar 3 Paul McCartney 1942 Bass 4 Ringo Starr 1940 Drums
If you ever need to convert between these representations, (which is a somewhat common operation, in practice) tidyr
is your tool for the job.
3.144.87.206