In general, when performing a statistical analysis, the majority of time spent ends up being dealing with data. This involves successfully inputting data as well as organizing or cleaning data. For now, our goal is to recognize various types of data files and use those types to import and export data to and from R.
Recognize common data file types by suffix.
Apply R packages and functions to import data into R.
Apply R packages and functions to export data out of R.
3.1 Setup
In order to practice inputting and outputting the various types of files you may use, we need to do some setup. In order to continue practicing creating and using projects, we will start a new project for this chapter. Inside that project, we also will create some folders and download some data files from this book’s GitHub site.
If necessary, review the steps in Chapter 1 to create a new project. After starting RStudio, on the upper-left menu ribbon, select File and click New Project. Choose New Directory ➤ New Project, with Directory name 03DataIO, and select Create Project. To create an R script file, on the top ribbon, right under the word File, click the small icon with a plus sign on top of a blank bit of paper, and select the R Script menu option. Click the floppy disk–shaped save icon, name this file DataIO.R, and click Save.
In the lower-right pane, you should see your project’s two files, and right under the Files tab, click the button titled New Folder. In the New Folder pop-up, type data and select OK. In the lower-right pane, click your new data folder. Repeat the folder creation process, making a new folder titled ch03.
Next, you need to download some files from this book’s site. They should live on GitHub at http://github.com/apress/beginning-r-4. However, the permanent link can also be found on the website http://mattwiley.org/publications.html by clicking Code for Beginning R. Navigate to this chapter’s folder and download the Texas counties data [3] as well as the American Community Survey [4] data into the ch03 folder you just created. In all, you should have six files in your project’s subfolders.
With that set up, you are ready to practice importing and exporting data.
If you need any refresher or quick tips on where to find things in RStudio as you go through the chapter, look at this short “cheat sheet” which has visuals and labels for most of RStudio’s features: https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf.
3.2 Input
When inputting files to R , the usual way data arrive is in some sort of table format where there are named columns or variables and numbered rows or observations. Each column thus represents a category, and each row represents a specific measurement of those variables.
The usual challenges to inputting these types of “mostly decent” data tend to be successfully reading the type of file storing the data and adapting to any unique aspects of different data files (such as the first few rows being of a different format than expected). An example of such a foible might be that a data gatherer/creator may include some notes or comments in the first few lines of an Excel file. It is only after these notes that the actual columns and rows of data start.
When we say the type of file, we mean things like an Excel spreadsheet or a comma-separated text file or a dataset from another proprietary piece of software. Almost always, the type of a file is indicated by the extension. The file extension are the last few characters of the file name after the period. For example, an Excel file might be called “data.xlsx”. In this case, the file name is “data” and the extension is “.xlsx”; and the extension, the “.xlsx” part, is what tells us that it is an Excel file. Other common extensions are “.csv” which means a text file with cells separated by commas and “.rds” which is R’s own format for storing data. All of these can be told by looking at the file extension.
Thus, our first task is to identify the correct function (and package) to read in our data as well as determine if we need to skip some rows. The majority of functions written to input data have at least three variables as options that include the location of the data file, the number of rows to skip, and whether the data has a header with column names. It is worth noting most data input functions generally default to assuming the first row is a column name header and no rows should be skipped.
Most input functions include many options – and in general most of those do not need to be used. In fact, the easiest way to take care of some messiness in a file is often to open it yourself by hand, fix it manually, and then import into R. On the other hand, if you routinely get a file (perhaps some sort of daily update from some company or third party) and that file is predictable in the way it does not work nicely, it may be worth exploring additional function options to code an automatic read.
Manual Entry
While we usually focus on input as something that involves some external file to read in, it is possible to input data by hand in R. This can be useful for several scenarios, including setting up some variables all the way to more extensive data input. In R, the assignment operator is <-. These two symbols together, the less than character and the -, are both found on the right-hand side of your keyboard, near M and P, respectively. It helps to give our data a name that makes sense; for example, if using the year 2020 as a variable, we may call it academicYear.
To help code be pronounceable, we often make data variables nouns. For the same reason, functions are often verbs (e.g., plot()).
While manual assignment is simple, it would be very tedious to type in all the data we would ever use directly into R. We move on to using some the packages we installed in Chapter 2.
CSV: .csv
Perhaps the most common data type is a table of information stored in the comma-separated values format which has the file extension .csv. While R can read such files natively, the data.table package has a function named fread() that reads in CSV files very fast indeed. This function has many inputs or arguments that are possible, although we only discuss three now. This function’s first input or argument is file = "". Because an RStudio project defaults to the project as the working directory (which we showed how to set in Chapter 1), we need only give the folder file path data/ch03/ and the file name Counties_in_Texas.csv to this argument. The second argument, header, defaults to an automatic feature which usually does a good job of guessing if there is a header. Should you know your file does or does not have a header, then you can manually set this argument to either TRUE or FALSE. The third argument we discuss is skip. It takes an integer (i.e., whole number, without any decimal places, like 0, 1, 2, 3) input that is the number of rows you wish to skip. The default is zero; therefore, like header, you can often safely ignore this argument. You cannot skip part of a row, so something like skip = 1.25 or skip = 3.5 is not valid.
Notice that after running the preceding line of code, while there is no output to the Console, our Environment pane definitely has 254 observations of 10 variables. You may click the arrow icon just left of csvData, to see the types of data just read. For example, CountyName is of type chr which stands for character data, while CountyNumber is of type int for integer or whole number. Look at https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf for more ways to interact with data through RStudio rather than just through code.
With 254 observations or rows, our data might be a bit long to visually inspect all rows. A newer way to explore data can be seen in your top-right Environment pane. There are both a dot with an arrow in it on the left side of each data set and a square table grid on the right side. Clicking the left-side dot and arrow will allow one to see what types of data are in each data set. In addition to showing the column names, it will also show the types of data (e.g., integer or Date or numeric). Another option is to click the square table grid on the far right. This will open the data set for viewing in a new tab. Closing that tab should get you back to your code area.
While it may not quite seem like it yet, you are now also doing statistics. Exploratory data analysis (EDA) is the first step in most analytics. It may not be fancy, but head() is an example of such exploration.
Excel: .xlsx or .xls
Microsoft’s Excel format is another common data format. The read_excel() function from the readxl package inputs such files and again has some useful default options. In addition to the usual three features of file path, column names, and skip, we discuss a fourth argument of sheet. Excel files often have more than one sheet. The default is to the first sheet, although you may enter either a number showing which order your sheet is in or quote text string which names the sheet (Excel often defaults to “Sheet1” for the first sheet).
When using functions written by different package authors, some default settings, and some choices, are often not the same. When reading data into R, it is generally a wise step to inspect your data and understand a little more about just what has been read. Another difference you may notice is that our Excel data mentions the word tibble. We save an in-depth discussion of data structure choices for our Advanced R: Data Programming and the Cloud [23] book. While it is good to notice this difference, we shall ignore it for now.
RDS: .rds
We started off showing how to read in data sets from CSV or Excel files, because those are probably the most common way of storing and sharing data sets. They are popular in part because almost everyone on any computer can open and see them, without needing extra software they do not already have.
Other Proprietary Formats
There are several other pieces of statistical software that are often used to store and analyze data. If you have a colleague who uses one of these programs, you may get a data file that uses that software’s data format. We are going to look at three of these very briefly, so that if you ever come across such data, you know how to get them into R.
SPSS: .sav
Stata: .dta
SAS: .sas7bdat
3.3 Output
Getting data into R is usually the first stage of an analysis project. Writing the data out of R to share with stakeholders or collaborators is usually one of the last stages. In an R project, it is often a good idea to separate data we read in from the data we write out. We already created a folder named data for the files to input. Now, we create a new folder, using the same New Folder icon in the lower-right pane, named output.
Any data that is in R is able to be written out. It does not matter where those data come from, only that they are in our global environment. Thus, options for writing out in the following examples include mtcars [1], penguins [11], and csvData, excelData, rData, sasData, spssData, or stataData. These are all part of our current instance of R. Because the mtcars and penguins data sets are built into packages, they do not show up in the Environment tab, but can still be written. In other words, we could write the sasData to a CSV file, and all would be well. While it is not an “official” feature of R, it can be used to convert between file types. To avoid having weird data calls, we will use the R data.
CSV
Excel
RDS
3.4 Summary
Chapter Summary
Idea | What It Means |
---|---|
<- | Assignment operator. |
fread() | CSV file read-in function from the data.table package. |
head() | Displays the first six rows of a data set. |
read_excel() | Excel XLSX or XLS file read-in function from the readxl package. |
readRDS() | RDS (native R format) file read-in function. |
tail() | Displays the last six rows of a data set. |
read_spss() | SPSS SAV file read-in function from the haven package. |
read_stata() | Stata DTA file read-in function from the haven package. |
read_sas() | SAS SAS7BDAT file read-in function from the haven package. |
fwrite() | CSV file write-out function from the data.table package. |
write_xlsx() | Excel XLSX file write-out function from the writexl package. |
saveRDS() | RDS (native R format) file write-out function. |
3.5 Practice for Mastery
Check your progress and grow through practice by working through some exercises. Comprehension checks ask critical thinking questions that may be best answered with a written or verbal response. Part of the art of statistics is successfully communicating results to your stakeholders or audience. Sometimes that audience is highly technical and other times very much not technical. Exercises are more direct applications of the concepts explored in the chapter.
Comprehension Checks
- 1.
Ask around! Question your classmates, colleagues, or friends about the types of data they use. What file formats are common? What is the column and row organization of their data?
Exercises
- 1.
Using the techniques learned in this chapter, write the penguins data set to a .csv file.
- 2.
Using the techniques learned in this chapter, write the mtcars data set to a .xlsx file.
- 3.
There are many datasets available to the public (although not all are free to use in a book). Find a dataset from your company, your local government, or some other group that interests you. Download the file, and use one of the functions you learned in this chapter to read the data into R.