Mapping the annual temperature in Spain

In the previous sections, we saw how to perform spatial interpolation of point data in R using several methods, including IDW, OK, and UK. We also learned that, in the case of minimum temperature in 2002, UK outperformed the other two methods in terms of accuracy, with a LOOCV RMSE value of 1.46, compared to 1.88 and 1.68 for IDW and OK, respectively. In this section, we are going to see how we can automate spatial interpolation in order to produce a set of temperature maps for different years with a single code execution command. For this purpose, we are going to construct two loops:

  • The first loop will go through 20 point layers (10 years * 2 variables), each time calculating the RMSE of LOOCV in prediction using each of the three methods (IDW, OK, and UK). Based on the results, we will select the most accurate method (on average) out of these three.
  • We will then construct the second loop to produce predicted temperature maps of Spain (for 5 years—2006-2010) using the selected method.

We will begin with the first task of systematically evaluating RMSE among the different years, variables, and methods. A very useful function for such tasks, named expand.grid, will help us keep track of the examination. Given a set of vectors, expand.grid returns a data.frame object with all possible combinations of the elements from each vector. The following expression, for example, creates a data.frame object, named cv_results, with four columns (variable, year, method, and rmse):

> cv_results = expand.grid(
+ variable = c("mmnt", "mmxt"),
+ year = 2001:2010,
+ method = c("IDW", "OK", "UK"),
+ rmse = NA)

Since only three of the columns have more than one possible element, and there are 2 variables * 10 years * 3 methods = 60 unique ways to combine them, cv_results has 60 rows. The first few rows of this data.frame object look as follows:

> head(cv_results)
  variable year method rmse
1     mmnt 2001    IDW   NA
2     mmxt 2001    IDW   NA
3     mmnt 2002    IDW   NA
4     mmxt 2002    IDW   NA
5     mmnt 2003    IDW   NA
6     mmxt 2003    IDW   NA

This table is going to help us in two ways:

  • First, we are going to go through its rows, each time performing LOOCV for yet another prediction model defined with the respective values in the variable, year, and method columns. This way, we can be sure that all the possible combinations are covered.
  • Second, the resulting RMSE values will be assigned in the respective position in the rmse column so that, by the time the loop execution is complete, we will conveniently have a table with all of the results.

The loop will be defined to go through all rows of cv_results, with i being assigned the current row index each time. The code within the loop will then perform the following operations:

  1. Create a point layer named dat (using the create_pnt function we previously defined) with the variable and year set according to the current row of cv_results (rather than at fixed values such as 2002 and "mmnt").
  2. Create an object defining the prediction formula (form) and variogram model (v_mod).
  3. Create a gstat object based on the appropriate point layer (dat), formula (form), and variogram model (v_mod).
  4. Perform LOOCV.
  5. Calculate RMSE and assign the result in the appropriate position in the rmse column of cv_results.

The code snippet, with the code sections for each of the preceding steps marked by a comment, is as follows:

> for(i in row(cv_results)) {
+ # (1) Create point layer as required
+ dat = create_pnt(stations = spain_stations,
+ annual = spain_annual,
+ year = cv_results$year[i],
+ variable = cv_results$variable[i],
+ new_proj = proj4string(dem_spain))
+ # (2) Create *form* and *v_mod* objects
+ if(cv_results$method[i] == "IDW") {
+ form = value ~ 1
+ v_mod = NULL} else {
+ if(cv_results$method[i] == "OK") {
+ form = value ~ 1}
+ if(cv_results$method[i] == "UK") {
+ form = value ~ elevation}
+ v_mod =
+ autofitVariogram(
+ formula = form,
+ input_data = dat)$var_model}
+ # (3) Create gstat object
+ g = gstat(formula = form, data = dat, model = v_mod)
+ # (4) Perform cross-validation
+ cv = gstat.cv(g)
+ # (5) Calculate RMSE and assign to cv_results
+ cv_results$rmse[i] = rmse(cv)
+ }

Note that, in step 2, nested conditional statements are used in order to determine form and v_mod according to the interpolation method that is being used. We can write down the procedure this particular code section performs in the form of a decision tree:

  • If the method is "IDW", then form is value~1 and v_mod is NULL
  • Otherwise:
    • If the method is "OK", then form is value~1
    • If the method is "UK", then form is value~elevation
    • The v_mod is calculated according to form

Once the loop execution is completed (this may take a few moments), the rmse column of cv_results will be filled with the resulting RMSE values instead of NA values:

> head(cv_results)
  variable year method     rmse
1     mmnt 2001    IDW 1.942872
2     mmxt 2001    IDW 1.594996
3     mmnt 2002    IDW 1.880490
4     mmxt 2002    IDW 1.570574
5     mmnt 2003    IDW 1.912887
6     mmxt 2003    IDW 1.605938

Note that, for instance, the RMSE value at the third row table (1.880490), which corresponds to the IDW interpolation of "mmnt" for 2002 as expected, is identical to the result we have previously obtained manually. Several interesting comparisons can be made using this RMSE table. As previously stated, we are mainly interested in seeing which interpolation method is the most accurate, on average, across 10 years and the two variables. To find out, we can use tapply:

> tapply(cv_results$rmse, cv_results$method, mean)
     IDW       OK       UK
1.773243 1.908957 1.458997

The result shows that, on average, UK yields the lowest prediction error among the three methods. Interestingly, OK is less accurate, on average, than IDW although in the particular example we previously saw it was the other way around. Based on this result, UK is going to be our method of choice to produce annual temperature maps of Spain.

Note

It should be noted that comparing cross-validation RMSE is only meaningful among different spatial prediction models of the same data. Otherwise, RMSE may be higher or lower not only because the prediction error is different but also because the predicted values themselves are on different scales. For example, RMSE of the precipitation amount ("tpcp") prediction would have been much higher than RMSE of temperature prediction in our case (you can try and see this), simply because the generally observed annual precipitation amounts in Spain (in mm) are higher by an order of magnitude than temperatures (in Celsius degrees). However, this does not mean that the prediction models of precipitation are less accurate with respect to the scale of the measured variable itself.

We are ready to move on and produce annual temperature maps using another loop that calculates the predicted values (rather than performing cross-validation, which we just did). Once again, a well-organized way of doing this is to construct a table describing the set of parameters we need for each spatial interpolation, then go through that table and perform the required operations in turn. We will once again use the expand.grid function, producing all combinations of just two variables this time: year (the 5 years 2006-2010) and variable ("mmnt" and "mmxt"). The method will be left unspecified as we have already decided it should be invariably UK.

> spainT_tab = expand.grid(
+ year = 2006:2010,
+ variable = c("mmnt", "mmxt"))

The resulting table, named spainT_tab, looks as follows:

> spainT_tab
   year variable
1  2006     mmnt
2  2007     mmnt
3  2008     mmnt
4  2009     mmnt
5  2010     mmnt
6  2006     mmxt
7  2007     mmxt
8  2008     mmxt
9  2009     mmxt
10 2010     mmxt

We are ready to create the second loop that will go through the rows of spainT_tab and spatially interpolate the respective temperature data. Since our results are going to be rasters (rather than numeric values, as was the case with the cross-validation loop) in this case, they cannot be appended to the table itself. Instead, we are going to stack the results in a multiband raster where each layer is going to correspond to a row in spainT_tab. In other words, spainT_tab is going to function as a supplementary table for the resulting multiband raster (much as dates served the same purpose for the raster r; see Chapter 4, Working with Rasters). To construct the raster, we will first create an empty RasterStack to which the interpolation results will be appended during loop execution:

> spainT = stack()

The loop will go through the rows of spainT_tab, each time performing the following set of operations:

  1. Creating a point layer named dat with the variable and year set according to the current row of spainT_tab.
  2. Fitting a variogram model (v).
  3. Creating a gstat object (g) according to dat and v.
  4. Calculating the raster of predicted values (z) with interpolate, based on g and dem_spain.
  5. Masking z according to dem_spain and attaching the result to spainT.

The loop code appears as follows:

> for(i in 1:nrow(spainT_tab)) {
+ # (1) Create point layer as required
+ dat = create_pnt(stations = spain_stations,
+ annual = spain_annual,
+ year = spainT_tab$year[i],
+ variable = spainT_tab$variable[i],
+ new_proj = proj4string(dem_spain))
+ # (2) Automatically fit variogram model
+ v = autofitVariogram(formula = value ~ elevation,
+ input_data = dat)
+ # (3) Create gstat object
+ g = gstat(formula = value ~ elevation,
+ model = v$var_model,
+ data = dat)
+ # (4) Interpolate!
+ z = interpolate(dem_spain, g, xyOnly = FALSE)
+ # (5) Mask and add predicted surface to results stack
+ spainT = stack(spainT, mask(z, dem_spain))
+ }

Once loop execution is complete, interpolation results are stored as individual layers of the spainT raster. The raster should consist of 10 layers, corresponding to the 10 rows in spainT_tab. Right now, the layer names are not very informative:

> names(spainT)
 [1] "var1.pred.1.1.1" "var1.pred.2.1.1" "var1.pred.1.2.1"
 [4] "var1.pred.2.2.1" "var1.pred.1.1.2" "var1.pred.2.1.2"
 [7] "var1.pred.1.2.2" "var1.pred.2.2.2" "var1.pred.1"    
[10] "var1.pred.2"

However, with spainT_tab at our disposal, we can easily change this as follows:

> names(spainT) = paste(spainT_tab$variable,
+ spainT_tab$year,
+ sep = "_")
> names(spainT)
 [1] "mmnt_2006" "mmnt_2007" "mmnt_2008" "mmnt_2009" "mmnt_2010"
 [6] "mmxt_2006" "mmxt_2007" "mmxt_2008" "mmxt_2009" "mmxt_2010"

Let's now plot spainT using the levelplot function. With layout=c(5,2), we make sure that the minimum and maximum temperature maps are plotted in distinct rows:

> library(rasterVis)
> levelplot(spainT, par.settings = "BuRdTheme", layout = c(5,2))

The graphical output is shown in the following screenshot:

Mapping the annual temperature in Spain

The preceding screenshot shows predicted minimum (top row) and maximum (bottom row) temperatures in Spain for each of the years from 2006 to 2010. As can be expected, maximum temperatures are always higher than minimum temperatures. The predicted spatial pattern of temperature is fairly similar over the years, which also makes sense—colder areas are generally colder repeatedly, in all years, as are warmer areas. However, the latter poses a problem when interpreting this type of result, as the characteristic temperature pattern within each year overwhelms the differences among years and makes it difficult for us to notice the year-to-year variation. This can be partially addressed by plotting minimum and maximum temperature rasters separately (in order to use a wider color scale each time). The interested reader can see for himself that it does little to solve the problem.

When we are interested in interannual differences only, the most reasonable course of action is to find out what is the characteristic annual temperature state, find out the deviation of each individual year's state from the characteristic state, and then compare the deviation images rather than the original ones. Calculating deviations from a general trend is a very common practice to highlight patterns of interest in our data, and there are many ways to find both the general trend and the deviations of each observation from it. Here, we will employ the simplest possible approach—the characteristic temperature in each pixel will be defined as the 5-year average observed in that pixel, and the deviations will be equal to arithmetic differences of each observation from the average.

Since we need to treat the minimum and maximum temperature data separately, we will first create two vectors identifying the relevant spainT layers for each case:

> mmnt_layers = which(spainT_tab$variable == "mmnt")
> mmnt_layers
[1] 1 2 3 4 5
> mmxt_layers = which(spainT_tab$variable == "mmxt")
> mmxt_layers
[1]  6  7  8  9 10

We will also create a new two-band RasterStack, named means, to hold the minimum and maximum mean temperature pattern in its first and second layers, respectively. The two layers will also be named accordingly ("mmnt" and "mmxt"):

> means = stack(mean(spainT[[mmnt_layers]]),
+ mean(spainT[[mmxt_layers]]))
> names(means) = c("mmnt", "mmxt")

Now that we have the means, all we need to do is subtract the appropriate mean from each set of layers in spainT (the mean of the minimum temperatures from layers 1-5 and the mean of the maximum temperatures from layers 6-10). There are numerous ways to do this. For example, we can divide the spainT raster into two, subtract the respective mean from each substack, and then combine them once again. A more elegant way, extendable to any number of categories, however, is to create a new means raster with the same number of layers as spainT, by means of duplication, with the relevant means occupying its layers. In other words, we need a raster with means[[1]] duplicated five times (occupying layers 1-5) and means[[2]] duplicated five times (occupying layers 6-10). To do this, we can utilize the fact that raster layers can also be selected by their names. The following expression therefore takes the "mmnt" and "mmxt" layers and duplicates them in agreement with the spainT_tab$variable column:

> means = means[[spainT_tab$variable]]

This gives us the desired result, a 10-band raster with temperature means for either the minimum or maximum temperature, according to spainT_tab (and thus matching spainT). The layer names reveal that indeed the first five layers correspond to the minimum temperature and the last five layers correspond to the maximum temperature:

> names(means)
 [1] "mmnt.1" "mmnt.2" "mmnt.3" "mmnt.4" "mmnt.5" "mmxt.1"
 [7] "mmxt.2" "mmxt.3" "mmxt.4" "mmxt.5"

What is left to be done is just to subtract one 10-band raster from the other:

> spainT = spainT - means

The data we now have in the new spainT raster is the temperature deviations from the five-year average pattern of either the minimum or maximum temperature. We can check and see that the deviations are within the range of -4 and +4 degrees Celsius around the respective 5-year mean of 2006-2010:

> range(spainT[], na.rm = TRUE)
[1] -4.020094  3.200148

Plotting these deviations will help us see more clearly where exactly the temperature was high or low, with respect to the average pattern, and what the magnitude of the departure was. This time, we will also indicate that contours should be drawn at 1°C intervals:

> levelplot(spainT,
+ par.settings = "BuRdTheme",
+ layout = c(5,2),
+ contour = TRUE,
+ at = c(-4:-1,1:4))

The graphical output is shown in the following screenshot:

Mapping the annual temperature in Spain

This time, we can easily notice, for example, that the minimum temperatures in 2010 were higher than average in southern Spain while the maximum temperatures were lower than average in the north of the country.

This is the final product we were looking for, according to the goals specified at the beginning of this section. However, we are not done yet. In the next chapter, we will use spainT as one of the sample datasets while discussing how to produce plots with the ggplot2 package. Since ggplot2 is a general-purpose graphical package, the input data must come as a data.frame object. Moreover, the data table should be tidy (see the previous chapter) in order to conveniently exploit all of the possibilities this package offers. What does that mean in the present case? It means that we need to have a data.frame object with all the data currently held in spainT, with each column corresponding to a variable and each row corresponding to an observation. In this case, the variables are as follows:

  • x-coordinate
  • y-coordinate
  • Variable
  • Year
  • Predicted temperature

Each row in such a data.frame object will correspond to one pixel of an individual layer in spainT.

Luckily, the as.data.frame function (that we encountered in the previous chapter, when converting a matrix to a data.frame) also has a method to convert rasters to the data.frame objects. By setting xy=TRUE, we are specifying that we would like to have a table not only with the raster values (from all of its layers) but also with the respective spatial coordinates:

> spainT = as.data.frame(spainT, xy = TRUE)
> head(spainT)
           x       y mmnt_2006 mmnt_2007 mmnt_2008 mmnt_2009
1 -11957.925 4857128        NA        NA        NA        NA
2  -8357.925 4857128        NA        NA        NA        NA
3  -4757.925 4857128        NA        NA        NA        NA
4  -1157.925 4857128        NA        NA        NA        NA
5   2442.075 4857128        NA        NA        NA        NA
6   6042.075 4857128        NA        NA        NA        NA
  mmnt_2010 mmxt_2006 mmxt_2007 mmxt_2008 mmxt_2009 mmxt_2010
1        NA        NA        NA        NA        NA        NA
2        NA        NA        NA        NA        NA        NA
3        NA        NA        NA        NA        NA        NA
4        NA        NA        NA        NA        NA        NA
5        NA        NA        NA        NA        NA        NA
6        NA        NA        NA        NA        NA        NA

Each row in this table represents a single pixel, while its spatial coordinates and values in each layer are specified in the respective columns. However, from the first few rows, we can see that NA cells were also included. Since they will not be plotted anyway, we can readily exclude them:

> spainT = spainT[complete.cases(spainT), ]
> head(spainT)
            x       y mmnt_2006    mmnt_2007   mmnt_2008 mmnt_2009
33  103242.08 4857128 0.7383187 -0.070727914 0.008270229 0.5218489
34  106842.08 4857128 0.7392765 -0.055515954 0.024604441 0.5227165
38  121242.08 4857128 0.7458803 -0.018719502 0.063198494 0.5252469
39  124842.08 4857128 0.7456079  0.002858330 0.084895939 0.5240072
40  128442.08 4857128 0.7454912  0.023069161 0.105275777 0.5226526
347  92442.08 4853528 0.7061661 -0.006305609 0.051584103 0.5112048
    mmnt_2010 mmxt_2006 mmxt_2007 mmxt_2008 mmxt_2009 mmxt_2010
33  -1.197710  2.012048 -2.361182  1.122896  1.941084 -2.714846
34  -1.231081  1.988164 -2.341944  1.118404  1.918249 -2.682873
38  -1.315606  1.932082 -2.291416  1.103507  1.857861 -2.602034
39  -1.357369  1.894492 -2.252492  1.094646  1.809158 -2.545804
40  -1.396489  1.859306 -2.215895  1.086683  1.763085 -2.493179
347 -1.262649  1.870778 -2.157029  1.059672  1.696265 -2.469685

Two things are left to be done to bring this data.frame object to the desired form:

  • First, we need to transform the table into a long form, creating a value column and keeping all other variables in separate individual columns. This can be achieved with the melt function (see Chapter 3, Working with Tables, for more details), specifying "x" and "y" columns as the ID variables so that the other columns are transferred into a single variable column:
    > library(reshape2)
    > spainT = melt(spainT, id.vars = c("x", "y"))
    > head(spainT)
              x       y  variable     value
    1 103242.08 4857128 mmnt_2006 0.7383187
    2 106842.08 4857128 mmnt_2006 0.7392765
    3 121242.08 4857128 mmnt_2006 0.7458803
    4 124842.08 4857128 mmnt_2006 0.7456079
    5 128442.08 4857128 mmnt_2006 0.7454912
    6  92442.08 4853528 mmnt_2006 0.7061661
  • Second, we need to split the variable column into two: the variable itself ("mmnt" or "mmxt") and the year (2006, 2007, 2008, 2009, or 2010). Substring extraction with substr (see Chapter 2, Working with Vectors and Time Series, for more information) comes in handy for this purpose since we can see that characters 1-4 in each element of spainT$variable consistently correspond to the variable, and characters 6-9 correspond to the year (to be sure, there are other ways to extract substrings in less convenient scenarios; for instance, the strsplit function can be used in the present context). Using the following two expressions, we first create a new (numeric) column holding the year values, and then modify the variable column to retain just the variable names:
    > spainT$year = as.numeric(substr(spainT$variable, 6, 9))
    > spainT$variable = substr(spainT$variable, 1, 4)
    > head(spainT)
              x       y variable     value year
    1 103242.08 4857128     mmnt 0.7383187 2006
    2 106842.08 4857128     mmnt 0.7392765 2006
    3 121242.08 4857128     mmnt 0.7458803 2006
    4 124842.08 4857128     mmnt 0.7456079 2006
    5 128442.08 4857128     mmnt 0.7454912 2006
    6  92442.08 4853528     mmnt 0.7061661 2006

The final data.frame object is now complete.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.254.44