© Matt Wiley, Joshua F. Wiley 2020
M. Wiley, J. F. WileyBeginning R 4 https://doi.org/10.1007/978-1-4842-6053-1_3

3. Data Input and Output

Matt Wiley1  and Joshua F. Wiley2
(1)
Victoria College, Victoria, TX, USA
(2)
Turner Institute for Brain and Mental Health, Monash University, Melbourne, Australia
 

In general, when performing a statistical analysis, the majority of time spent ends up being dealing with data. This involves successfully inputting data as well as organizing or cleaning data. For now, our goal is to recognize various types of data files and use those types to import and export data to and from R.

By the end of this chapter, you should be able to
  • Recognize common data file types by suffix.

  • Apply R packages and functions to import data into R.

  • Apply R packages and functions to export data out of R.

3.1 Setup

In order to practice inputting and outputting the various types of files you may use, we need to do some setup. In order to continue practicing creating and using projects, we will start a new project for this chapter. Inside that project, we also will create some folders and download some data files from this book’s GitHub site.

If necessary, review the steps in Chapter 1 to create a new project. After starting RStudio, on the upper-left menu ribbon, select File and click New Project. Choose New DirectoryNew Project, with Directory name 03DataIO, and select Create Project. To create an R script file, on the top ribbon, right under the word File, click the small icon with a plus sign on top of a blank bit of paper, and select the R Script menu option. Click the floppy disk–shaped save icon, name this file DataIO.R, and click Save.

In the lower-right pane, you should see your project’s two files, and right under the Files tab, click the button titled New Folder. In the New Folder pop-up, type data and select OK. In the lower-right pane, click your new data folder. Repeat the folder creation process, making a new folder titled ch03.

Next, you need to download some files from this book’s site. They should live on GitHub at http://github.com/apress/beginning-r-4. However, the permanent link can also be found on the website http://mattwiley.org/publications.html by clicking Code for Beginning R. Navigate to this chapter’s folder and download the Texas counties data [3] as well as the American Community Survey [4] data into the ch03 folder you just created. In all, you should have six files in your project’s subfolders.

With that set up, you are ready to practice importing and exporting data.

If you need any refresher or quick tips on where to find things in RStudio as you go through the chapter, look at this short “cheat sheet” which has visuals and labels for most of RStudio’s features: https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf.

3.2 Input

When inputting files to R , the usual way data arrive is in some sort of table format where there are named columns or variables and numbered rows or observations. Each column thus represents a category, and each row represents a specific measurement of those variables.

The usual challenges to inputting these types of “mostly decent” data tend to be successfully reading the type of file storing the data and adapting to any unique aspects of different data files (such as the first few rows being of a different format than expected). An example of such a foible might be that a data gatherer/creator may include some notes or comments in the first few lines of an Excel file. It is only after these notes that the actual columns and rows of data start.

Note

When we say the type of file, we mean things like an Excel spreadsheet or a comma-separated text file or a dataset from another proprietary piece of software. Almost always, the type of a file is indicated by the extension. The file extension are the last few characters of the file name after the period. For example, an Excel file might be called “data.xlsx”. In this case, the file name is “data” and the extension is “.xlsx”; and the extension, the “.xlsx” part, is what tells us that it is an Excel file. Other common extensions are “.csv” which means a text file with cells separated by commas and “.rds” which is R’s own format for storing data. All of these can be told by looking at the file extension.

Thus, our first task is to identify the correct function (and package) to read in our data as well as determine if we need to skip some rows. The majority of functions written to input data have at least three variables as options that include the location of the data file, the number of rows to skip, and whether the data has a header with column names. It is worth noting most data input functions generally default to assuming the first row is a column name header and no rows should be skipped.

Most input functions include many options – and in general most of those do not need to be used. In fact, the easiest way to take care of some messiness in a file is often to open it yourself by hand, fix it manually, and then import into R. On the other hand, if you routinely get a file (perhaps some sort of daily update from some company or third party) and that file is predictable in the way it does not work nicely, it may be worth exploring additional function options to code an automatic read.

Manual Entry

While we usually focus on input as something that involves some external file to read in, it is possible to input data by hand in R. This can be useful for several scenarios, including setting up some variables all the way to more extensive data input. In R, the assignment operator is <-. These two symbols together, the less than character and the -, are both found on the right-hand side of your keyboard, near M and P, respectively. It helps to give our data a name that makes sense; for example, if using the year 2020 as a variable, we may call it academicYear.

Note

To help code be pronounceable, we often make data variables nouns. For the same reason, functions are often verbs (e.g., plot()).

To read the following line of code out loud, we would say “the variable academicYear is assigned the number 2020”:
academicYear <- 2020
Notice that after running the preceding line, in your top-right Environment pane, there is now a value for academicYear that shows the value 2020. If we need to use the academic year in our code, we could now easily just type academicYear and recover our result:
academicYear
## [1] 2020
Revisiting a familiar data set, mtcars, let us explore how we might use assignment in another way. Recall the wt variable was the weight values of the cars. Should we wish to use those values, as we saw before, we could specifically call out that single column by assigning it to a variable name. Again, after assignment, we are allowed to use our variable directly:
mtcarsWeight <- mtcars$wt
mtcarsWeight
##  [1]  2.620 2.875 2.320 3.215 3.440 3.460 3.570 3.190 3.150 3.440 3.440
## [12]  4.070 3.730 3.780 5.250 5.424 5.345 2.200 1.615 1.835 2.465 3.520
## [23]  3.435 3.840 3.845 1.935 2.140 1.513 3.170 2.770 3.570 2.780

While manual assignment is simple, it would be very tedious to type in all the data we would ever use directly into R. We move on to using some the packages we installed in Chapter 2.

CSV: .csv

Perhaps the most common data type is a table of information stored in the comma-separated values format which has the file extension .csv. While R can read such files natively, the data.table package has a function named fread() that reads in CSV files very fast indeed. This function has many inputs or arguments that are possible, although we only discuss three now. This function’s first input or argument is file = "". Because an RStudio project defaults to the project as the working directory (which we showed how to set in Chapter 1), we need only give the folder file path data/ch03/ and the file name Counties_in_Texas.csv to this argument. The second argument, header, defaults to an automatic feature which usually does a good job of guessing if there is a header. Should you know your file does or does not have a header, then you can manually set this argument to either TRUE or FALSE. The third argument we discuss is skip. It takes an integer (i.e., whole number, without any decimal places, like 0, 1, 2, 3) input that is the number of rows you wish to skip. The default is zero; therefore, like header, you can often safely ignore this argument. You cannot skip part of a row, so something like skip = 1.25 or skip = 3.5 is not valid.

While packages are normally placed at the beginning of a set of code, in this case, we include the library() call to the data.table package here, to help remind us that without that library call, fread() will not work. We meet the assignment operator again, as we assign our data to a variable named csvData:
library(data.table)
## data.table 1.13.0 using 6 threads (see ?getDTthreads). Latest news: r-datatable.com
csvData <- fread(file = "data/ch03/Counties_in_Texas.csv",
                  header = TRUE,
                  skip = 0)

Notice that after running the preceding line of code, while there is no output to the Console, our Environment pane definitely has 254 observations of 10 variables. You may click the arrow icon just left of csvData, to see the types of data just read. For example, CountyName is of type chr which stands for character data, while CountyNumber is of type int for integer or whole number. Look at https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf for more ways to interact with data through RStudio rather than just through code.

With 254 observations or rows, our data might be a bit long to visually inspect all rows. A newer way to explore data can be seen in your top-right Environment pane. There are both a dot with an arrow in it on the left side of each data set and a square table grid on the right side. Clicking the left-side dot and arrow will allow one to see what types of data are in each data set. In addition to showing the column names, it will also show the types of data (e.g., integer or Date or numeric). Another option is to click the square table grid on the far right. This will open the data set for viewing in a new tab. Closing that tab should get you back to your code area.

Often, it is enough to view just the first few rows. The head() function allows us to inspect the first six rows by default:
head(csvData)
##    CountyName FIPSNumber CountyNumber PublicHealthRegion
## 1:   Anderson          1            1                  4
## 2:    Andrews          3            2                  9
## 3:   Angelina          5            3                  5
## 4:    Aransas          7            4                 11
## 5:     Archer          9            5                  2
## 6:  Armstrong         11            6                  1
##    HealthServiceRegion MetropolitanStatisticalArea_MSA
## 1:                4/5N                              --
## 2:                9/10                              --
## 3:                4/5N                              --
## 4:                  11                  Corpus Christi
## 5:                 2/3                   Wichita Falls
## 6:                   1                        Amarillo
##    MetropolitanDivisions MetroArea NCHSUrbanRuralClassification_2006
## 1:                    -- Non-Metro                      Micropolitan
## 2:                    -- Non-Metro                      Micropolitan
## 3:                    -- Non-Metro                      Micropolitan
## 4:                    --     Metro                      Medium Metro
## 5:                    --     Metro                       Small Metro
## 6:                    --     Metro                       Small Metro
##    NCHSUrbanRuralClassification_2013
## 1:                      Micropolitan
## 2:                      Micropolitan
## 3:                      Micropolitan
## 4:                      Medium Metro
## 5:                       Small Metro
## 6:                      Medium Metro

While it may not quite seem like it yet, you are now also doing statistics. Exploratory data analysis (EDA) is the first step in most analytics. It may not be fancy, but head() is an example of such exploration.

Excel: .xlsx or .xls

Microsoft’s Excel format is another common data format. The read_excel() function from the readxl package inputs such files and again has some useful default options. In addition to the usual three features of file path, column names, and skip, we discuss a fourth argument of sheet. Excel files often have more than one sheet. The default is to the first sheet, although you may enter either a number showing which order your sheet is in or quote text string which names the sheet (Excel often defaults to “Sheet1” for the first sheet).

For our data set, the variable excelData is assigned the output of the read_excel() function for the .xlsx version of the counties in Texas data. While we set sheet = 1 to capture our sheet by sheet order, we could just as well have used sheet = "Counties_in_Texas. To show this in our code itself, we use a comment which is done by using the pound sign or hashtag symbol character. Comments in our code are for humans to read and are not run by R:
library(readxl)
excelData <- read_excel(path = "data/ch03/Counties_in_Texas.xlsx",
                        sheet = 1, #or sheet = "Counties_in_Texas"
                        col_names = TRUE,
                        skip = 0
                        )
This is the same data set as was read from the CSV file. Or is it? Using the head() function on our new data set, compare the FIPSNumber columns and notice they look slightly different. A closer inspection (such as via the arrow drop-down icon in the Environment pane) shows that FIPSNumber is of type int in csvData while it is of type chr in excelData:
head(excelData)
## # A tibble: 6 x 10
##   CountyName FIPSNumber CountyNumber PublicHealthReg˜ HealthServiceRe˜
##   <chr>      <chr>             <dbl>            <dbl> <chr>
## 1 Anderson   001                   1                4 4/5N
## 2 Andrews    003                   2                9 9/10
## 3 Angelina   005                   3                5 4/5N
## 4 Aransas    007                   4               11 11
## 5 Archer     009                   5                2 2/3
## 6 Armstrong  011                   6                1 1
## # ... with 5 more variables: MetropolitanStatisticalArea_MSA <chr>,
## #   MetropolitanDivisions <chr>, MetroArea <chr>,
## #   NCHSUrbanRuralClassification_2006 <chr>,
## #   NCHSUrbanRuralClassification_2013 <chr>

When using functions written by different package authors, some default settings, and some choices, are often not the same. When reading data into R, it is generally a wise step to inspect your data and understand a little more about just what has been read. Another difference you may notice is that our Excel data mentions the word tibble. We save an in-depth discussion of data structure choices for our Advanced R: Data Programming and the Cloud [23] book. While it is good to notice this difference, we shall ignore it for now.

RDS: .rds

We started off showing how to read in data sets from CSV or Excel files, because those are probably the most common way of storing and sharing data sets. They are popular in part because almost everyone on any computer can open and see them, without needing extra software they do not already have.

R does also have its own data format named R Data “Storage” which has the extension .rds. The ability to read and write .rds files is built into R, so we do not need any extra packages. Additionally, because these files come from R, there is only one argument of interest, the file = argument. R automatically knows how to read them in, what indicates variables, the type of variable, what the variable names vs. data are, any notes, and so on:
rData <- readRDS(file = "data/ch03/Counties_in_Texas.rds")
It would get repetitive to see the same data too many times. Thus, we introduce the tail() function which shows the last six rows of our data set:
tail(rData)
##    CountyName FIPSNumber CountyNumber PublicHealthRegion
## 1:       Wise        497          249                  3
## 2:       Wood        499          250                  4
## 3:     Yoakum        501          251                  1
## 4:      Young        503          252                  2
## 5:     Zapata        505          253                 11
## 6:     Zavala        507          254                  8
##    HealthServiceRegion MetropolitanStatisticalArea_MSA
## 1:                 2/3     Dallas-Fort Worth-Arlington
## 2:                4/5N                              --
## 3:                   1                              --
## 4:                 2/3                              --
## 5:                  11                              --
## 6:                   8                              --
##      MetropolitanDivisions MetroArea
## 1: Fort Worth-Arlington MD     Metro
## 2:                      -- Non-Metro
## 3:                      -- Non-Metro
## 4:                      -- Non-Metro
## 5:                      -- Non-Metro
## 6:                      -- Non-Metro
##    NCHSUrbanRuralClassification_2006
## 1:                Large Fringe Metro
## 2:                           Noncore
## 3:                           Noncore
## 4:                           Noncore
## 5:                           Noncore
## 6:                           Noncore
##    NCHSUrbanRuralClassification_2013
## 1:                Large Fringe Metro
## 2:                          Non-core
## 3:                          Non-core
## 4:                          Non-core
## 5:                      Micropolitan
## 6:                          Non-core

Other Proprietary Formats

There are several other pieces of statistical software that are often used to store and analyze data. If you have a colleague who uses one of these programs, you may get a data file that uses that software’s data format. We are going to look at three of these very briefly, so that if you ever come across such data, you know how to get them into R.

SPSS: .sav

The last three formats we discuss all use the haven package. Keep in mind there is no need to run library(haven) more than once per R session. Thus, we run it just here. By now, the usual arguments are familiar. The IBM Corporation owns the data software known as SPSS. Files from this software can have the extension .sav. The function call from haven to read this format is read_spss():
library(haven)
spssData <- read_spss(file = "data/ch03/Counties_in_Texas.sav",
                      skip = 0)

Stata: .dta

Stata is a statistical data software with a file extension of .dta. The haven package also has a read_stata() function to read in this type of files:
stataData <- read_stata("data/ch03/Counties_in_Texas.dta",
                        skip = 0)

SAS: .sas7bdat

Our last statistical software data format is SAS which has a rather long extension of .sas7bdat. We use a file from the US Census’s American Community Survey [23] for variety. This particular file has over 38 thousand observations. Data can get very large these days. All the same, note that R reads this file in fast enough. Data must generally get quite large indeed before there is difficulty with modern memory:
sasData <- read_sas("data/ch03/ACS_1yr_Seq_Table_Number_Lookup.sas7bdat",
                    skip = 0)
head(sasData)
## # A tibble: 6 x 9
##   filed tblid  seq   order position cells total title    subject_area
##   <chr> <chr>  <chr> <dbl>    <dbl> <chr> <dbl> <chr>    <chr>
## 1 ACSSF B00001 0001     NA        7 "1 CE˜   NA UNWEIGH˜ "Unweighted˜
## 2 ACSSF B00001 0001     NA       NA ""       NA Univers˜ ""
## 3 ACSSF B00001 0001      1       NA ""       NA Total    ""
## 4 ACSSF B00002 0001     NA        8 "1 CE˜    2 UNWEIGH˜ "Unweighted˜
## 5 ACSSF B00002 0001     NA       NA ""       NA Univers˜ ""
## 6 ACSSF B00002 0001      1       NA ""       NA Total    ""

3.3 Output

Getting data into R is usually the first stage of an analysis project. Writing the data out of R to share with stakeholders or collaborators is usually one of the last stages. In an R project, it is often a good idea to separate data we read in from the data we write out. We already created a folder named data for the files to input. Now, we create a new folder, using the same New Folder icon in the lower-right pane, named output.

Any data that is in R is able to be written out. It does not matter where those data come from, only that they are in our global environment. Thus, options for writing out in the following examples include mtcars [1], penguins [11], and csvData, excelData, rData, sasData, spssData, or stataData. These are all part of our current instance of R. Because the mtcars and penguins data sets are built into packages, they do not show up in the Environment tab, but can still be written. In other words, we could write the sasData to a CSV file, and all would be well. While it is not an “official” feature of R, it can be used to convert between file types. To avoid having weird data calls, we will use the R data.

CSV

For universal portability, a CSV file format is most often the answer; most software suites can read in CSV files by default. Keep in mind even R has a base function that reads in CSV files; it is only our choice that leads us to use the data table option. Similarly, most software systems can readily save or write or export CSV files. Again, R has this built-in feature as well, although we choose to use the data.table function fwrite(). This function takes two arguments. The first is set to the data we wish to write out. The second is the file path and name of where our data will be written:
fwrite(x = rData,
       file = "output/ch3_txCounties.csv")

Excel

Microsoft’s Excel format is another widely usable format, and the writexl package provides a function write_xlsx() to achieve this. Two additional features of this function include the option to suppress the column names (the default is as shown TRUE) as well as an option to not center and bold the column names (again the default is as shown TRUE). For office environments where colleagues use Excel, these defaults are usually a very human-friendly way to share data out to decision makers. The default bolding of the column names in particular helps reduce confusion about the meaning of shared data:
library(writexl)
write_xlsx(x = rData,
           path = "output/ch3_txCounties.xlsx",
          col_names = TRUE,
           format_headers = TRUE)

RDS

Our last export format is R’s default format. Sometimes, objects are simply saved for later use. If you intend to reload a data set for later work, the saveRDS() function is a very memory efficient choice. On our system, the RDS format is less than half the size of the CSV file:
saveRDS(object = rData,
        file = "output/ch3_txCounties.RDS")

3.4 Summary

This chapter discussed ways to read and write data in and out of R. Using the functions and packages summarized in Table 3-1, you will be able to access and share data from many sources of information.
Table 3-1

Chapter Summary

Idea

What It Means

<-

Assignment operator.

fread()

CSV file read-in function from the data.table package.

head()

Displays the first six rows of a data set.

read_excel()

Excel XLSX or XLS file read-in function from the readxl package.

readRDS()

RDS (native R format) file read-in function.

tail()

Displays the last six rows of a data set.

read_spss()

SPSS SAV file read-in function from the haven package.

read_stata()

Stata DTA file read-in function from the haven package.

read_sas()

SAS SAS7BDAT file read-in function from the haven package.

fwrite()

CSV file write-out function from the data.table package.

write_xlsx()

Excel XLSX file write-out function from the writexl package.

saveRDS()

RDS (native R format) file write-out function.

3.5 Practice for Mastery

Check your progress and grow through practice by working through some exercises. Comprehension checks ask critical thinking questions that may be best answered with a written or verbal response. Part of the art of statistics is successfully communicating results to your stakeholders or audience. Sometimes that audience is highly technical and other times very much not technical. Exercises are more direct applications of the concepts explored in the chapter.

Comprehension Checks

  1. 1.

    Ask around! Question your classmates, colleagues, or friends about the types of data they use. What file formats are common? What is the column and row organization of their data?

     

Exercises

  1. 1.

    Using the techniques learned in this chapter, write the penguins data set to a .csv file.

     
  2. 2.

    Using the techniques learned in this chapter, write the mtcars data set to a .xlsx file.

     
  3. 3.

    There are many datasets available to the public (although not all are free to use in a book). Find a dataset from your company, your local government, or some other group that interests you. Download the file, and use one of the functions you learned in this chapter to read the data into R.

     
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.105.15