Working with tables is central to programming in R, both with regards to spatial analysis (for example, working with attribute tables of geometries) and more generally. In this chapter, we will learn how to work with tables on their own, while in the subsequent chapters, we will see the ways that spatial data analysis involves dealing with tables. At the same time, two central subjects, which we will have to be familiar with for the subsequent chapters, will be introduced. These are working with contributed packages in R and controlling code execution.
As a central example, we will work with real-world data (monthly climatic records for Spain, which were downloaded from the NOAA archive) so that we can witness several very common cleaning and reshaping procedures of tables.
In this chapter, we'll cover the following topics:
data.frame
objects to represent tables in Rapply
functionsIn this section, you will learn how tables are represented in R and how you can work with tabular objects. In particular, you will learn two common ways to create table objects (from vectors or by reading a file from the disk). Afterwards, you will learn how to examine, subset, and make calculations with tables.
The data.frame
class is the basic class to represent tabular data in R. A data.frame
object is essentially a collection of vectors, all with the same length. However, the vectors do not have to be of the same type. They may also include one-dimensional objects that are not strictly vectors, such as Date
or factor
objects (see the previous chapter). Therefore, data.frame
objects are particularly suitable to represent data with different variables in columns and different cases in rows. Thus, variables may be of different types; for example, a table storing climatic data may have one character variable to store meteorological station names, another Date
variable to represent measurement dates, and a third numeric variable to represent the measured values such as rainfall amounts or temperatures.
One way to create a data.frame
object is to combine several vectors that are already present in the R environment. This can be achieved with the data.frame
function with the arguments being the names of the vector objects we would like to combine. Let's take a look at the following examples:
> num = 1:4 > lower = c("a","b","c","d") > upper = c("A","B","C","D") > df = data.frame(num, lower, upper) > df num lower upper 1 1 a A 2 2 b B 3 3 c C 4 4 d D
Here, we created a data.frame
object named df
by combining the vectors num
, lower
, and upper
. The previously independent vectors now comprise columns in df
. As we can see, the names of the columns appear on the first line of the printed output of a data.frame
object. These are the names of the original vectors, num
, lower
, and upper
. Rows have names as well; these are automatically assigned with the characters 1
, 2
, 3
, and 4
(as it appears to the left of the first column in the printed output).
Note that in this example, vector types were different (num
is numeric, whereas lower
and upper
are characters). However, the vectors had the same length; otherwise, an error would have occurred since all columns of data.frame
must have the same length.
An important parameter of the data.frame
function (and several other functions such as read.csv
, which will be introduced in the next section) is stringsAsFactors
. The stringsAsFactors
parameter controls whether character columns are automatically converted to factors (the default value is TRUE
). Within the context of this book, we would usually like to keep the character vectors as characters for greater flexibility (we can always make the conversion to factors ourselves when necessary with the factor
function; see the previous chapter for more information). Therefore, a function call preserving character columns will be as follows:
> df = data.frame(num, lower, upper, + stringsAsFactors = FALSE) > df num lower upper 1 1 a A 2 2 b B 3 3 c C 4 4 d D
The way our table is printed on screen is identical when compared to the previous example. However, using methods, which will be introduced later, we will be able to see that columns 2 and 3 now consist of character vectors rather than factors.
We can add rows or columns to an existing data.frame
object using the rbind
(row bind) and cbind
(column bind) functions, respectively. For example, we can add a fifth row to our df
table using rbind
as follows:
> row5 = c(5,"e","E") > rbind(df, row5) num lower upper 1 1 a A 2 2 b B 3 3 c C 4 4 d D 5 5 e E
Alternatively, we could add a fourth column using cbind
as follows:
> word = c("One","Two","Three","Four") > cbind(df, word, stringsAsFactors = FALSE) num lower upper word 1 1 a A One 2 2 b B Two 3 3 c C Three 4 4 d D Four
In the previous example, we had to specify, once again, that we do not want the character vector, word
, to be converted into a factor
vector.
Another common method to create a data.frame
object is to read tabular data from the disk. For example, we can read a CSV file using the read.csv
function (which was briefly mentioned earlier). The first parameter of this function, and the one with no defaults, is a file indicating the path to the CSV file. For example, the following expression reads the contents of the 343452.csv
file and assigns it to a data.frame
object called dat
(remember that directories should be separated with \
or /
):
> dat = read.csv("C:\Data\343452.csv")
The 343452.csv
file contains monthly records of precipitation, minimum temperature, and maximum temperature from Spain for a period of 30 years. It was downloaded from the NOAA climatic archive and provided as is. Since we will use data from this file in several of our examples, in this and the upcoming chapters, let's examine its contents. Because the table is very large, to see what it looks like, we can print only the first several rows with the head
function, as follows (similarly, with the tail
function, we can print the several last rows):
> head(dat) STATION STATION_NAME ELEVATION LATITUDE LONGITUDE 1 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 2 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 3 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 4 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 5 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 6 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 DATE TPCP MMXT MMNT 1 19840101 514 56 -5 2 19840201 0 98 28 3 19840301 687 72 2 4 19840401 136 122 35 5 19840501 214 118 30 6 19840601 65 196 105
The column's contents are as follows:
STATION
: This is the meteorological station identification codeSTATION_NAME
: This is the meteorological station nameELEVATION
: This is the elevation of the station above sea level (meters)LATITUDE
: This is the latitude of the station (decimal degrees)LONGITUDE
: This is the longitude of the station (decimal degrees)DATE
: This is the date of measurementTPCP
: This is the total monthly precipitation (0.1 mm units)MMXT
: This is the mean monthly maximum temperature (0.1 degree Celsius units)MMNT
: This is the mean monthly minimum temperature (0.1 degree Celsius units)The complete documentation for the CSV file is provided along with the data from NOAA (and can also be downloaded from the book's website).
According to what you learned in the previous chapter, the dates are recorded in the %Y%m%d
format (in Date
objects terminology). However, since the data is monthly, the day component is not informative (we can see that all days are coded as 01
). The missing values of measurements are marked as -9999
, a commonly encountered convention.
We can get the number of rows and columns in our data.frame
object using the nrow
and ncol
functions, respectively. For example, our small table df
has four rows and three columns, while dat
(containing the monthly climatic data) has 28,536 rows and nine columns:
> nrow(df) [1] 4 > ncol(df) [1] 3 > nrow(dat) [1] 28536 > ncol(dat) [1] 9
We can, if the table is not too long, print the table's contents and see how many columns (or rows) are there, according to the row names. However, it is generally advisable to get the properties of an object using functions (such as ncol
), rather than typing a specific number manually (such as 9
). This way, our code is going to be transferable to an analysis of any object and not just the specific object we are currently working on.
We can get the lengths of both row and column dimensions using the dim
function. If our argument is a data.frame
object (we will see later that the dim
function works with other classes as well; such a function is called a generic function in R terminology), a vector of length 2 is returned with the first element being the number of rows and the second being the number of columns, as follows:
> dim(dat) [1] 28536 9
We can also get the names of the rows and columns (getting column names is often more useful) as a character vector using the functions rownames
and colnames
, as shown in the following example:
> colnames(dat) [1] "STATION" "STATION_NAME" "ELEVATION" "LATITUDE" [5] "LONGITUDE" "DATE" "TPCP" "MMXT" [9] "MMNT"
Assignment into column names can be made to replace the existing names with new ones. For example, to change the name of the third column from ELEVATION
to Elev
, we can use the colnames(dat)[3]="Elev"
expression. Similarly, we can convert all column names of the data.frame
object from uppercase to lowercase using the tolower
function so that it will be easier to type:
> colnames(dat) = tolower(colnames(dat)) > colnames(dat) [1] "station" "station_name" "elevation" "latitude" [5] "longitude" "date" "tpcp" "mmxt" [9] "mmnt"
It is frequently useful to examine the structure of a given object using the str
function. This function (which is also generic) prints the structure of its argument showing the data types of its components and the relations between them. In the case of a data.frame
object, a list of the column names and types is printed, along with the table dimensions, and the first several values (or all values, if the table is very short). For example, the output for the small table df
shows that we have a table with three columns (the variables) and four rows (the observations). It also shows that the first column is numeric and the last two are characters. Here is how the output looks like:
> str(df) 'data.frame': 4 obs. of 3 variables: $ num : int 1 2 3 4 $ lower: chr "a" "b" "c" "d" $ upper: chr "A" "B" "C" "D"
There are two principal ways to create a subset of a data.frame
object. The first involves accessing separate columns, using the column names, with the $
operator. The second involves providing the two vectors of indices, names or logical values, with the [
operator.
Using the $
operator, we can gain access to separate columns in a data.frame
object. To do this, we simply insert the name of the data.frame
to the left of the $
operator and the name of the required column to the right, as follows:
> df$num [1] 1 2 3 4 > df$lower [1] "a" "b" "c" "d" > df$upper [1] "A" "B" "C" "D"
Since the columns of a data.frame
object are basically vectors, we can employ all the previously presented vector methods in columns of a data.frame
object the same way we would in independent vectors. For example, we can replace the -9999
values (which mark the missing data) with NA
, for each of the three measured variables in dat
, as follows:
> dat$tpcp[dat$tpcp == -9999] = NA > dat$mmxt[dat$mmxt == -9999] = NA > dat$mmnt[dat$mmnt == -9999] = NA
The only difference from how we did this operation in the previous chapter is the dat$
part. This means that we refer to columns of the data.frame
object (dat
), rather than independent vectors. Now, let's convert the tpcp
values to mm units and mmxt
and mmnt
values to degree Celsius units by dividing each value in the respective columns by 10, as follows:
> dat$tpcp = dat$tpcp / 10 > dat$mmxt = dat$mmxt/ 10 > dat$mmnt = dat$mmnt / 10
Note that if we would have made the division by 10 before encoding the -9999
values as NA
, we would have got the -999.9
values, while now that we have NA
values, they will remain NA
since NA/10
gives NA
. This highlights the importance of representing missing data with NA
to reduce the chance of mistakes.
Using the [
operator, we can obtain a subset of a data.frame
object, which will include the intersection of any number of rows and columns. This works the same way as vectors subsetting with one difference, that is, a data.frame
is a two-dimensional object while a vector is a one-dimensional object. Therefore, we need to provide two indices rather than just one.
Remember that when subsetting with the [
operator in data.frame
, the first index refers to rows and the second index refers to columns. This arrangement is going to appear in other contexts as well (for example, in matrices and rasters).
The two vectors of indices, used in order to create a subset of a data.frame
object, can include any combination of the following:
data.frame
We have been extensively using methods 1 and 3 in vectors (see the previous chapter), so extending the methods to the two-dimensional case should be intuitive. In fact, method 2 can also be used with vectors since vectors can have element names the same way that a data.frame
object has row names and column names (but we are not going to use that here).
For example, the following expression gives us the element populating the second row and the third column of df
:
> df[2, 3] [1] "B"
Leaving an empty space instead of the row's or column's index indicates we are interested in all the elements of the respective dimension (all rows or all columns). For example, the following expressions return all elements of the second row and the third column of df
:
> df[2, ] num lower upper 2 2 b B > df[ ,3] [1] "A" "B" "C" "D"
By default, a subset of a data.frame
object is converted into a simpler class if values from a single column are involved. For example, the df[2, 3]
and df[ ,3]
expressions returned (character) vectors. The df[2, ]
expression returned a data.frame
object since three columns are involved. In fact, the second row of df
contains both numeric and character values, while we already know that a vector can only contain values of the same type. If we wish, we can suppress the data.frame
simplification by using the drop
parameter, indicating FALSE
(instead of the default value, TRUE
), and then the subset will remain a data.frame
object no matter what:
> df[ ,3, drop = FALSE] upper 1 A 2 B 3 C 4 D
Compare the output to the one from the previous example. Using drop=FALSE
in the previous expression, we got a data.frame
object (with four rows and one column) instead of a vector.
The other two methods of subsetting a data.frame
object are using logical and character vectors as indices. Let's take a look at the following example:
> df[df$lower %in% c("a","d"), c("lower","upper")] lower upper 1 a A 4 d D
In this expression, if we put it in plain language, we are requesting to get the subset of df
with the rows being where the values of the lower
column are either "a"
or "d"
, and the columns are both lower
and upper
.
One very helpful function to use with data.frame
objects is complete.cases
. This function returns a logical vector, the same length as the number of rows in the data.frame
object, indicating whether each row (case) is complete (has no NA
values in it). When a given row is complete, the respective element in the logical vector will be TRUE
; when a row is incomplete, the value will be FALSE
. Then, the resulting logical vector can be used to remove the incomplete rows from a table as follows (note that the output is not printed here to save space):
> dat[complete.cases(dat), ]
Note that in this expression, we use the vector returned by complete.cases
as a logical vector indicating the selection of rows in dat
to be retained.
As previously shown, we can assign new values to a column of a table (or to a subset of a column) using the $
operator. If the column name we assign does not exist in the table, a new column will be created to accommodate the data. Let's take a look at the following examples:
> df num lower upper 1 1 a A 2 2 b B 3 3 c C 4 4 d D > df$word[df$num == 2] = "Two" > df num lower upper word 1 1 a A <NA> 2 2 b B Two 3 3 c C <NA> 4 4 d D <NA>
Here, we made an assignment of the character value "Two"
to a subset of the word
column (which did not previously exist in df
) corresponding to the rows where the value of column num
is equal to 2
. As a result, a new column has been created, which contains the assigned value (and NA
for all the other elements). Note that <NA>
is simply the character representation of NA
(the column word
is a character vector).
As another example with our climatic data, we are going to create two new columns, holding the year and month of each measurement. For this purpose, we will first convert the date
column to a Date
object. Then, we will extract the years and months from the data in this column as follows (see the previous chapter for details):
> dat$date = as.Date(as.character(dat$date), format = "%Y%m%d") > dat$month = as.numeric(format(dat$date, "%m")) > dat$year = as.numeric(format(dat$date, "%Y")) > head(dat) station station_name elevation latitude longitude 1 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 2 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 3 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 4 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 5 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 6 GHCND:SP000060010 IZANA SP 2371 28.3089 -16.4992 date tpcp mmxt mmnt month year 1 1984-01-01 51.4 5.6 -0.5 1 1984 2 1984-02-01 0.0 9.8 2.8 2 1984 3 1984-03-01 68.7 7.2 0.2 3 1984 4 1984-04-01 13.6 12.2 3.5 4 1984 5 1984-05-01 21.4 11.8 3.0 5 1984 6 1984-06-01 6.5 19.6 10.5 6 1984
Using the first expression, we converted the dat$date
vector to a Date
object (and assigned it back to dat$date
). In the second and third expressions, we extracted the month and year components, as numeric vectors, out of dat$date
and assigned them to the new columns, dat$month
and dat$year
, respectively.
A data.frame
object can be written to a CSV file with the write.csv
function. The two first (and most important) parameters for this function indicate the name of the data.frame
object, which we would like to save, and the path to the new file (including the new filename). These parameters have no defaults, so we need to specify them. For example, the following expression writes the data.frame
object df
to the df.csv
file in the C:Data
directory:
> write.csv(df, "C:\Data\df.csv")
The newly created file when opened in Excel looks like the following screenshot:
Note that row names (the numbers 1 to 4 in column A) and column names have been added; this behavior can be disabled when required.
3.145.191.134