Using R for statistical analysis

For this example, we will use the datasets package, which has a wide variety of sample datasets. In RStudio, use the Packages tab to verify the datasets package is in the list and has a checkmark next to it. The Packages tab is in the interface in the lower-right quadrant.

Import the NOAA 15 minute Colorado precipitation dataset that we have been working with by clicking on the Import Dataset button and selecting From CSV.... Navigate to the file, review the data column preview and code preview, and then click on the Import button.

Note the code that was generated for you, which loads the data. You can also use this as a template to load in data files without going through the GUI menu:

#Bring in the code library for reading in text files
library(readr)

#Load the 15 minute precipitation dataset
NOAA15minPrecipColorado <- read_csv("~/Downloads/NOAA15minPrecipColorado.csv")

#Show the dataset in an RStudio window
View(NOAA15minPrecipColorado)

The code loads in the data file, and then executes the View() function to display it in a table window. You will see the data in the upper-left pane. In the bottom-left pane (console), review any errors that occurred during the loading and parsing of the dataset. This will also have useful information on how R interpreted the data columns. You will need to decide whether the data errors are tolerable for the analysis you are doing. If not, some additional formatting and parsing will be needed.

Run the following code to generate summary statistics on each column in the dataset:

#Run summary statistics on the data frame the file was loaded into

summary(NOAA15minPrecipColorado)

The resulting summary will look like the following:

Note the extreme values for Min. and Max. that we have already discovered in Qgag and Qpcp. We decided they are probably an indicator flag and not actual measurements. We can remove those values by some data manipulation. If you do not already have the R package dplyr, you can install it by either running the following code or using the RStudio GUI. The dplyr package is very useful for data manipulation and is used frequently for analytics with R:

#Install the package for dplyr on your laptop
install.packages("dplyr")

After dplyr is installed, run the following code to filter out data rows that have the extreme values for either Qgag or Qpcp. The code will also rerun the summary statistics:

#Bring in the code library for dplyr
library(dplyr)

#Filter records with the extreme value out of the dataset
NOAAfiltered <- filter(NOAA15minPrecipColorado, QPCP > 0, QPCP <900, QGAG >0, QGAG<900)

#Run summary statistics on the filtered copy of the dataset
summary(NOAAfiltered)

The summary should now look like the following. Note the averages, medians, and quartiles are now more representative of real-world precipitation values:

In practice, when you have missing or invalid values, you need to carefully review the dataset and decide whether it is better to remove the entire data record or to replace the value with something else. When you filter out the row, you will lose other fields, which may hold valid values. Your decision will depend on the type of analysis you are doing. Think carefully.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.6.75