Chapter 3. Working with Tables

Working with tables is central to programming in R, both with regards to spatial analysis (for example, working with attribute tables of geometries) and more generally. In this chapter, we will learn how to work with tables on their own, while in the subsequent chapters, we will see the ways that spatial data analysis involves dealing with tables. At the same time, two central subjects, which we will have to be familiar with for the subsequent chapters, will be introduced. These are working with contributed packages in R and controlling code execution.

As a central example, we will work with real-world data (monthly climatic records for Spain, which were downloaded from the NOAA archive) so that we can witness several very common cleaning and reshaping procedures of tables.

In this chapter, we'll cover the following topics:

  • Working with data.frame objects to represent tables in R
  • Controlling code execution through conditional statements and loops
  • Automated calculations on tables and vectors using the apply functions
  • Installing and using contributed packages in R
  • Reshaping tables into different forms
  • Joining tables

Using the data.frame class to represent tabular data

In this section, you will learn how tables are represented in R and how you can work with tabular objects. In particular, you will learn two common ways to create table objects (from vectors or by reading a file from the disk). Afterwards, you will learn how to examine, subset, and make calculations with tables.

Creating a table from separate vectors

The data.frame class is the basic class to represent tabular data in R. A data.frame object is essentially a collection of vectors, all with the same length. However, the vectors do not have to be of the same type. They may also include one-dimensional objects that are not strictly vectors, such as Date or factor objects (see the previous chapter). Therefore, data.frame objects are particularly suitable to represent data with different variables in columns and different cases in rows. Thus, variables may be of different types; for example, a table storing climatic data may have one character variable to store meteorological station names, another Date variable to represent measurement dates, and a third numeric variable to represent the measured values such as rainfall amounts or temperatures.

One way to create a data.frame object is to combine several vectors that are already present in the R environment. This can be achieved with the data.frame function with the arguments being the names of the vector objects we would like to combine. Let's take a look at the following examples:

> num = 1:4
> lower = c("a","b","c","d")
> upper = c("A","B","C","D")
> df = data.frame(num, lower, upper)
> df
  num lower upper
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D

Here, we created a data.frame object named df by combining the vectors num, lower, and upper. The previously independent vectors now comprise columns in df. As we can see, the names of the columns appear on the first line of the printed output of a data.frame object. These are the names of the original vectors, num, lower, and upper. Rows have names as well; these are automatically assigned with the characters 1, 2, 3, and 4 (as it appears to the left of the first column in the printed output).

Note

We can also create the data.frame object in a single step by performing the vector assignments within the data.frame function call itself:

> df = data.frame(
+ num = 1:4,
+ lower = c("a","b","c","d"),
+ upper = c("A","B","C","D"))

Note that in this example, vector types were different (num is numeric, whereas lower and upper are characters). However, the vectors had the same length; otherwise, an error would have occurred since all columns of data.frame must have the same length.

Note

An important parameter of the data.frame function (and several other functions such as read.csv, which will be introduced in the next section) is stringsAsFactors. The stringsAsFactors parameter controls whether character columns are automatically converted to factors (the default value is TRUE). Within the context of this book, we would usually like to keep the character vectors as characters for greater flexibility (we can always make the conversion to factors ourselves when necessary with the factor function; see the previous chapter for more information). Therefore, a function call preserving character columns will be as follows:

> df = data.frame(num, lower, upper, 
+ stringsAsFactors = FALSE)
> df
  num lower upper
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D

The way our table is printed on screen is identical when compared to the previous example. However, using methods, which will be introduced later, we will be able to see that columns 2 and 3 now consist of character vectors rather than factors.

We can add rows or columns to an existing data.frame object using the rbind (row bind) and cbind (column bind) functions, respectively. For example, we can add a fifth row to our df table using rbind as follows:

> row5 = c(5,"e","E")
> rbind(df, row5)
  num lower upper
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D
5   5     e     E

Alternatively, we could add a fourth column using cbind as follows:

> word = c("One","Two","Three","Four")
> cbind(df, word, stringsAsFactors = FALSE)
  num lower upper  word
1   1     a     A   One
2   2     b     B   Two
3   3     c     C Three
4   4     d     D  Four

In the previous example, we had to specify, once again, that we do not want the character vector, word, to be converted into a factor vector.

Creating a table from a CSV file

Another common method to create a data.frame object is to read tabular data from the disk. For example, we can read a CSV file using the read.csv function (which was briefly mentioned earlier). The first parameter of this function, and the one with no defaults, is a file indicating the path to the CSV file. For example, the following expression reads the contents of the 343452.csv file and assigns it to a data.frame object called dat (remember that directories should be separated with \ or /):

> dat = read.csv("C:\Data\343452.csv")

The 343452.csv file contains monthly records of precipitation, minimum temperature, and maximum temperature from Spain for a period of 30 years. It was downloaded from the NOAA climatic archive and provided as is. Since we will use data from this file in several of our examples, in this and the upcoming chapters, let's examine its contents. Because the table is very large, to see what it looks like, we can print only the first several rows with the head function, as follows (similarly, with the tail function, we can print the several last rows):

> head(dat)
            STATION STATION_NAME ELEVATION LATITUDE LONGITUDE
1 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
2 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
3 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
4 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
5 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
6 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
      DATE TPCP MMXT MMNT
1 19840101  514   56   -5
2 19840201    0   98   28
3 19840301  687   72    2
4 19840401  136  122   35
5 19840501  214  118   30
6 19840601   65  196  105

The column's contents are as follows:

  • STATION: This is the meteorological station identification code
  • STATION_NAME: This is the meteorological station name
  • ELEVATION: This is the elevation of the station above sea level (meters)
  • LATITUDE: This is the latitude of the station (decimal degrees)
  • LONGITUDE: This is the longitude of the station (decimal degrees)
  • DATE: This is the date of measurement
  • TPCP: This is the total monthly precipitation (0.1 mm units)
  • MMXT: This is the mean monthly maximum temperature (0.1 degree Celsius units)
  • MMNT: This is the mean monthly minimum temperature (0.1 degree Celsius units)

The complete documentation for the CSV file is provided along with the data from NOAA (and can also be downloaded from the book's website).

According to what you learned in the previous chapter, the dates are recorded in the %Y%m%d format (in Date objects terminology). However, since the data is monthly, the day component is not informative (we can see that all days are coded as 01). The missing values of measurements are marked as -9999, a commonly encountered convention.

Examining the structure of a data.frame object

We can get the number of rows and columns in our data.frame object using the nrow and ncol functions, respectively. For example, our small table df has four rows and three columns, while dat (containing the monthly climatic data) has 28,536 rows and nine columns:

> nrow(df)
[1] 4
> ncol(df)
[1] 3
> nrow(dat)
[1] 28536
> ncol(dat)
[1] 9

We can, if the table is not too long, print the table's contents and see how many columns (or rows) are there, according to the row names. However, it is generally advisable to get the properties of an object using functions (such as ncol), rather than typing a specific number manually (such as 9). This way, our code is going to be transferable to an analysis of any object and not just the specific object we are currently working on.

We can get the lengths of both row and column dimensions using the dim function. If our argument is a data.frame object (we will see later that the dim function works with other classes as well; such a function is called a generic function in R terminology), a vector of length 2 is returned with the first element being the number of rows and the second being the number of columns, as follows:

> dim(dat)
[1] 28536     9

We can also get the names of the rows and columns (getting column names is often more useful) as a character vector using the functions rownames and colnames, as shown in the following example:

> colnames(dat)
[1] "STATION"      "STATION_NAME" "ELEVATION"    "LATITUDE"    
[5] "LONGITUDE"    "DATE"         "TPCP"         "MMXT"        
[9] "MMNT"

Assignment into column names can be made to replace the existing names with new ones. For example, to change the name of the third column from ELEVATION to Elev, we can use the colnames(dat)[3]="Elev" expression. Similarly, we can convert all column names of the data.frame object from uppercase to lowercase using the tolower function so that it will be easier to type:

> colnames(dat) = tolower(colnames(dat))
> colnames(dat)
[1] "station"      "station_name" "elevation"    "latitude"    
[5] "longitude"    "date"         "tpcp"         "mmxt"        
[9] "mmnt"

It is frequently useful to examine the structure of a given object using the str function. This function (which is also generic) prints the structure of its argument showing the data types of its components and the relations between them. In the case of a data.frame object, a list of the column names and types is printed, along with the table dimensions, and the first several values (or all values, if the table is very short). For example, the output for the small table df shows that we have a table with three columns (the variables) and four rows (the observations). It also shows that the first column is numeric and the last two are characters. Here is how the output looks like:

> str(df)
'data.frame':   4 obs. of  3 variables:
 $ num  : int  1 2 3 4
 $ lower: chr  "a" "b" "c" "d"
 $ upper: chr  "A" "B" "C" "D"

Subsetting data.frame objects

There are two principal ways to create a subset of a data.frame object. The first involves accessing separate columns, using the column names, with the $ operator. The second involves providing the two vectors of indices, names or logical values, with the [ operator.

Using the $ operator, we can gain access to separate columns in a data.frame object. To do this, we simply insert the name of the data.frame to the left of the $ operator and the name of the required column to the right, as follows:

> df$num
[1] 1 2 3 4
> df$lower
[1] "a" "b" "c" "d"
> df$upper
[1] "A" "B" "C" "D"

Since the columns of a data.frame object are basically vectors, we can employ all the previously presented vector methods in columns of a data.frame object the same way we would in independent vectors. For example, we can replace the -9999 values (which mark the missing data) with NA, for each of the three measured variables in dat, as follows:

> dat$tpcp[dat$tpcp == -9999] = NA
> dat$mmxt[dat$mmxt == -9999] = NA
> dat$mmnt[dat$mmnt == -9999] = NA

The only difference from how we did this operation in the previous chapter is the dat$ part. This means that we refer to columns of the data.frame object (dat), rather than independent vectors. Now, let's convert the tpcp values to mm units and mmxt and mmnt values to degree Celsius units by dividing each value in the respective columns by 10, as follows:

> dat$tpcp = dat$tpcp / 10
> dat$mmxt = dat$mmxt/ 10
> dat$mmnt = dat$mmnt / 10

Note that if we would have made the division by 10 before encoding the -9999 values as NA, we would have got the -999.9 values, while now that we have NA values, they will remain NA since NA/10 gives NA. This highlights the importance of representing missing data with NA to reduce the chance of mistakes.

Using the [ operator, we can obtain a subset of a data.frame object, which will include the intersection of any number of rows and columns. This works the same way as vectors subsetting with one difference, that is, a data.frame is a two-dimensional object while a vector is a one-dimensional object. Therefore, we need to provide two indices rather than just one.

Remember that when subsetting with the [ operator in data.frame, the first index refers to rows and the second index refers to columns. This arrangement is going to appear in other contexts as well (for example, in matrices and rasters).

The two vectors of indices, used in order to create a subset of a data.frame object, can include any combination of the following:

  • A numeric vector, in which case the numeric vector refers to the indices of rows/columns to retain in the subset
  • A character vector, in which case the character vector indicates the names of rows/columns to retain
  • A logical vector, in which case the logical vector indicates whether to retain each row/column of data.frame

We have been extensively using methods 1 and 3 in vectors (see the previous chapter), so extending the methods to the two-dimensional case should be intuitive. In fact, method 2 can also be used with vectors since vectors can have element names the same way that a data.frame object has row names and column names (but we are not going to use that here).

For example, the following expression gives us the element populating the second row and the third column of df:

> df[2, 3]
[1] "B"

Leaving an empty space instead of the row's or column's index indicates we are interested in all the elements of the respective dimension (all rows or all columns). For example, the following expressions return all elements of the second row and the third column of df:

> df[2, ]
  num lower upper
2   2     b     B
> df[ ,3]
[1] "A" "B" "C" "D"

By default, a subset of a data.frame object is converted into a simpler class if values from a single column are involved. For example, the df[2, 3] and df[ ,3] expressions returned (character) vectors. The df[2, ] expression returned a data.frame object since three columns are involved. In fact, the second row of df contains both numeric and character values, while we already know that a vector can only contain values of the same type. If we wish, we can suppress the data.frame simplification by using the drop parameter, indicating FALSE (instead of the default value, TRUE), and then the subset will remain a data.frame object no matter what:

> df[ ,3, drop = FALSE]
  upper
1     A
2     B
3     C
4     D

Compare the output to the one from the previous example. Using drop=FALSE in the previous expression, we got a data.frame object (with four rows and one column) instead of a vector.

The other two methods of subsetting a data.frame object are using logical and character vectors as indices. Let's take a look at the following example:

> df[df$lower %in% c("a","d"), c("lower","upper")]
  lower upper
1     a     A
4     d     D

In this expression, if we put it in plain language, we are requesting to get the subset of df with the rows being where the values of the lower column are either "a" or "d", and the columns are both lower and upper.

One very helpful function to use with data.frame objects is complete.cases. This function returns a logical vector, the same length as the number of rows in the data.frame object, indicating whether each row (case) is complete (has no NA values in it). When a given row is complete, the respective element in the logical vector will be TRUE; when a row is incomplete, the value will be FALSE. Then, the resulting logical vector can be used to remove the incomplete rows from a table as follows (note that the output is not printed here to save space):

> dat[complete.cases(dat), ]

Note that in this expression, we use the vector returned by complete.cases as a logical vector indicating the selection of rows in dat to be retained.

Calculating new data fields

As previously shown, we can assign new values to a column of a table (or to a subset of a column) using the $ operator. If the column name we assign does not exist in the table, a new column will be created to accommodate the data. Let's take a look at the following examples:

> df
  num lower upper
1   1     a     A
2   2     b     B
3   3     c     C
4   4     d     D
> df$word[df$num == 2] = "Two"
> df
  num lower upper word
1   1     a     A <NA>
2   2     b     B  Two
3   3     c     C <NA>
4   4     d     D <NA>

Here, we made an assignment of the character value "Two" to a subset of the word column (which did not previously exist in df) corresponding to the rows where the value of column num is equal to 2. As a result, a new column has been created, which contains the assigned value (and NA for all the other elements). Note that <NA> is simply the character representation of NA (the column word is a character vector).

As another example with our climatic data, we are going to create two new columns, holding the year and month of each measurement. For this purpose, we will first convert the date column to a Date object. Then, we will extract the years and months from the data in this column as follows (see the previous chapter for details):

> dat$date = as.Date(as.character(dat$date), format = "%Y%m%d")
> dat$month = as.numeric(format(dat$date, "%m"))
> dat$year = as.numeric(format(dat$date, "%Y"))
> head(dat)
            station station_name elevation latitude longitude
1 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
2 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
3 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
4 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
5 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
6 GHCND:SP000060010     IZANA SP      2371  28.3089  -16.4992
        date tpcp mmxt mmnt month year
1 1984-01-01 51.4  5.6 -0.5     1 1984
2 1984-02-01  0.0  9.8  2.8     2 1984
3 1984-03-01 68.7  7.2  0.2     3 1984
4 1984-04-01 13.6 12.2  3.5     4 1984
5 1984-05-01 21.4 11.8  3.0     5 1984
6 1984-06-01  6.5 19.6 10.5     6 1984

Using the first expression, we converted the dat$date vector to a Date object (and assigned it back to dat$date). In the second and third expressions, we extracted the month and year components, as numeric vectors, out of dat$date and assigned them to the new columns, dat$month and dat$year, respectively.

Writing a data.frame object to a CSV file

A data.frame object can be written to a CSV file with the write.csv function. The two first (and most important) parameters for this function indicate the name of the data.frame object, which we would like to save, and the path to the new file (including the new filename). These parameters have no defaults, so we need to specify them. For example, the following expression writes the data.frame object df to the df.csv file in the C:Data directory:

> write.csv(df, "C:\Data\df.csv")

The newly created file when opened in Excel looks like the following screenshot:

Writing a data.frame object to a CSV file

Note that row names (the numbers 1 to 4 in column A) and column names have been added; this behavior can be disabled when required.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.191.134