Other messiness

As we discussed in this chapter's preface, there are countless ways that a dataset may be messy. There are many other messy situations and solutions that we couldn't discuss at length here. In order that you, dear reader, are not left in the dark regarding custodial solutions, here are some other remedies which you may find helpful along your analytics journey:

OpenRefine

Though OpenRefine (formerly Google Refine) doesn't have anything to do with R per se, it is a sophisticated tool for working with and for cleaning up messy data. Among its numerous, sophisticated capabilities is the capacity to auto-detect misspelled or mispecified categories and fix them at the click of a button.

Regular expressions

Suppose you find that there are commas separating every third digit of the numbers in a numeric column. How would you remove them? Or suppose you needed to strip a currency symbol from values in columns that hold monetary values so that you can compute with them as numbers. These, and vastly more complicated text transformations, can be performed using regular expressions (a formal grammar for specifying the search patterns in text) and associate R functions like grep and sub. Any time spent learning regular expressions will pay enormous dividends over your career as an analyst, and there are many great, free tutorials available on the web for this purpose.

tidyr

There are a few different ways in which you can represent the same tabular dataset. In one form—called long, narrow, stacked, or entity-attribute-value model—each row contains an observation ID, a variable name, and the value of that variable. For example:

           member  attribute  value
1     Ringo Starr  birthyear   1940
2  Paul McCartney  birthyear   1942
3 George Harrison  birthyear   1943
4     John Lennon  birthyear   1940
5     Ringo Starr instrument  Drums
6  Paul McCartney instrument   Bass
7 George Harrison instrument Guitar
8     John Lennon instrument Guitar

In another form (called wide or unstacked), each of the observation's variables are stored in each column:

           member birthyear instrument
1 George Harrison      1943     Guitar
2     John Lennon      1940     Guitar
3  Paul McCartney      1942       Bass
4     Ringo Starr      1940      Drums

If you ever need to convert between these representations, (which is a somewhat common operation, in practice) tidyr is your tool for the job.

tidyr
tidyr
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.203.146