Extracting raster values based on vector layers

So far, we have covered operations to transform a vector layer to a raster and vice versa. The third operation involving vector layers and rasters, and the focus of this final section, is the extraction of raster values according to vector layers. We are often interested in reducing or summarizing raster data using point, line, or polygon features, which is when this operation comes in handy. For example, we may wish to calculate the elevation profile covered by a GPS track (raster-to-line extraction) or the average NDVI of a given forest (raster to polygon extraction). In this section, we will see two examples involving extraction by points and extraction by polygons.

Extracting by points

Extraction of raster values, according to a vector layer of any kind, can be done with the extract function. The first two parameters of this function are as follows:

  • The raster whose values are to be extracted (x)
  • The object (usually a vector layer) defining the locations to extract values (y) from

When extracting values according to points, which is the simplest extract scenario, the returned object may be either of the following:

  • Vector—when extracting values from a single band raster
  • Matrix—when extracting values from a multiband raster

As an example of extracting single band raster values to points, we will use the spain_stations.csv file we previously created (see Chapter 3, Working with Tables), which contains a table with the spatial location records of 96 meteorological stations in Spain. We will create a SpatialPointsDataFrame object based on this table, and then use it to extract elevation values from a DEM of Spain. The DEM will be obtained from a file (available on the book's website) named spain_elev.tif.

First, we will read the DEM file into R. This file, by the way, was created by merging SRTM tiles downloaded with getData—exactly as we did to create the Haifa DEM in the previous chapter—followed by reprojection, masking, and aggregation to a 900 meter resolution.

> dem_spain = raster("C:\Data\spain_elev.tif")

Now, we will read the station records from spain_stations.csv:

> stations = read.csv("C:\Data\spain_stations.csv",
+ stringsAsFactors = FALSE)

As you remember, the station locations are stored in the longitude and latitude columns; therefore, the data.frame object can be promoted to a SpatialPointsDataFrame object as follows:

> coordinates(stations) = ~ longitude + latitude

The station coordinates are given in degrees according to a geographic CRS. Therefore, we will define this CRS, and then reproject stations to match the CRS of dem_spain (UTM zone 30N) as follows:

> proj4string(stations) = CRS("+proj=longlat +datum=WGS84")
> stations = spTransform(stations, CRS(proj4string(dem_spain)))

Let's plot stations on top of dem_spain to see that they indeed match in location:

> plot(dem_spain)
> plot(stations, add = TRUE)

The graphical output is shown in the following screenshot:

Extracting by points

What we see here is the elevation map of Spain (the dem_spain raster), with station locations on top (stations point layer). The stations seem to cover the whole area of the country, more or less evenly. Note that some of the stations are not shown since they are located on the Canary Islands and thus they are beyond the scope of the dem_spain raster.

To extract the elevation values underlying each station, we employ the extract function with the raster and the points as the first two arguments, respectively. Since the returned object is going to be a numeric vector with the respective elevation values for each station, we can directly assign it to a new column in stations named elev_dem:

> stations$elev_dem = extract(dem_spain, stations)

Examining the attribute table of stations shows that now we have two elevation entries per station. These are the elevation values originally provided along with the meteorological records (the elevation column) and the elevation values we just obtained from the DEM (the elev_dem column):

> head(stations@data)
            station elevation elev_dem
1 GHCND:SP000003195       667   651.96
2 GHCND:SP000004452       185   184.24
3 GHCND:SP000006155         7     5.80
4 GHCND:SP000008027       251   212.44
5 GHCND:SP000008181         4     3.55
6 GHCND:SP000008202       790   787.28

Examining this table more closely will show that some of the stations were assigned with NA in the elev_dem column either since they are located near Spain's border (and incidentally outside the DEM scope) or since they are located in the Canary Islands (which the dem_spain raster does not cover at all).

It would be interesting to see the degree of agreement between the two sources of information by plotting one vector as a function of the other:

> plot(elev_dem ~ elevation, stations,
+ xlim = c(0, 2000),
+ ylim = c(0, 2000),
+ xlab = "Elevation from station record (m)",
+ ylab = "Elevation from DEM (m)")

Note

Note that the plot(elev_dem~elevation,stations) expression is analogous to plot(stations$elev_dem~stations$elevation). In the latter syntax, the vector names are provided explicitly; while in the former (often more convenient), the formula addresses columns from a data.frame object, provided as a second argument.

To make the assessment more convenient, we will also use the abline function that can add a straight line to an existing plot. One way of specifying the line's location is providing its intercept (a) and slope (b). We are going to add a 1:1 line (that is, a line with an intercept of 0 and a slope of 1), in order to see how well the elevation and elev_dem records match:

> abline(a = 0, b = 1, lty = "dotted")

The graphical output is shown in the following screenshot:

Extracting by points

Each point corresponds to a single station, and we can see the respective elevation and elev_dem values on the x and y axes, respectively. The agreement between the two sources of information is, unsurprisingly, very good, except for a few stations that lie slightly farther from the 1:1 line.

Extracting by polygons

When extracting raster values using line or polygon layers, each feature may correspond to more than one value in each raster layer (unlike with points, where each feature always corresponds to a single value). In fact, a variable number of values may be extracted for each feature, depending upon its shape. For example, the polygon delineating the Lahav forest, which we created earlier, covers 10,586 Landsat pixels and 38 MODIS pixels, while the Kramim polygon covers 7,538 Landsat pixels and 28 MODIS pixels. We have two ways to deal with this variation:

  • Reducing the vector of values from each feature into a single value (specifying a function to do that, with the fun parameter), in which case the returned object may be, just as with point layers:
    • A vector—when extracting from a single-band raster
    • A matrix—when extracting from a multiband raster
  • Keeping all values (specifying no function with fun=NULL, the default value), in which case the returned object will be a list (with df=FALSE, the default value) or a data.frame object (with df=TRUE)

When a function is provided with the fun parameter, the additional parameter na.rm determines whether NA cells are included in the calculation.

Proceeding with the forest example, we will now complete the two remaining steps: extracting the NDVI data and arranging it in a table. We are going to extract NDVI values according to the forests layer with the first alternative—summarizing the values (specifically with fun=mean, giving a mean NDVI per forest). Our ultimate goal is to have, by the end of this chapter, a table with three ID columns (date, sat – satellite, and forest) and a fourth column (ndvi) holding the measured average NDVI values.

Starting with the assembly of NDVI images (ndvi_98, ndvi_00, ndvi_03, and r), we will read the Landsat images from 1998 and 2003 and calculate NDVI (assuming that r and ndvi_00 are already in memory, as is our custom-made ndvi function; see Chapter 4, Working with Rasters).

> l_98 = brick("C:\Data\landsat_15_10_1998.tif")
> l_03 = brick("C:\Data\landsat_11_09_2003.tif")
> ndvi_98 = calc(l_98, fun = ndvi)
> ndvi_03 = calc(l_03, fun = ndvi)

Next, we will create a Date object, named l_dates, holding the Landsat image dates (see the filenames). We will use this object later when creating a table of results.

> l_dates = as.Date(c("1998-10-15", "2000-10-04", "2003-09-11"))

We are ready to proceed with the extraction—employing the extract function on the three Landsat NDVI images to obtain the mean NDVI values per forest, per date:

> l_forests = extract(stack(ndvi_98, ndvi_00, ndvi_03),
+ forests,
+ fun = mean,
+ na.rm = TRUE)

Note

Note that to make things clearer, we first apply extract on a RasterStack object of the three Landsat images. The MODIS data will be extracted from r in a separate step.

Since we are extracting values from a multiband raster, yet employing a function (mean) to summarize those values; the returned object, assigned to l_forests, is a matrix. Its two rows correspond to the forests features, while its three columns correspond to the layers of stack(ndvi_98,ndvi_00,ndvi_03). For example, we can see that the average NDVI observed by Landsat on October 15, 1998 in the Lahav forest was 0.3053538:

> l_forests
       layer.1   layer.2  layer.3
[1,] 0.3053538 0.2487563 0.284487
[2,] 0.2840073 0.2190098 0.243326

Right now we can already tell that, in both forests, NDVI decreased between 1998-2000 and then (incompletely) recovered between 2000-2003.

By repeating the same procedure with r, we will create the analogous r_forests matrix:

> r_forests = extract(r,   
+ forests,
+ fun = mean,
+ na.rm = TRUE)

This time the matrix has 280 columns since r has 280 layers:

> dim(r_forests)
[1]   2 280

Proceeding with the third step, we would like to have the information from l_forests and r_forests in a single data.frame object with all NDVI values in a single column, and additional columns characterizing the measurements (date, sat, and forest). Starting with the l_forests matrix, we will first transpose it (using the t function) and convert it to a data.frame object (using the as.data.frame function):

> l_forests = as.data.frame(t(l_forests))
> l_forests
               V1        V2
layer.1 0.3053538 0.2840073
layer.2 0.2487563 0.2190098
layer.3 0.2844870 0.2433260

Now, we can set the appropriate column names (the forest names) and create new columns for dates (obtained from l_date) and satellite ("Landsat") as follows:

> colnames(l_forests) = forests$name
> l_forests$date = l_dates
> l_forests$sat = "Landsat"

The new l_forests matrix looks as follows:

> l_forests
            Lahav    Kramim       date     sat
layer.1 0.3053538 0.2840073 1998-10-15 Landsat
layer.2 0.2487563 0.2190098 2000-10-04 Landsat
layer.3 0.2844870 0.2433260 2003-09-11 Landsat

Exactly the same procedure is repeated for r_forests (with acquisition dates taken from dates$date and the satellite name set to "MODIS"):

> r_forests = as.data.frame(t(r_forests))
> colnames(r_forests) = forests$name
> r_forests$date = dates$date
> r_forests$sat = "MODIS"

Now, we can combine the two data.frame objects using rbind:

> forests_ndvi = rbind(l_forests, r_forests)

The combined data.frame object, which we named forests_ndvi, contains all average NDVI records for the two forests from the two satellites, collectively for 283 dates (three dates from Landsat and 280 dates from MODIS). Its first few rows are printed as follows:

> head(forests_ndvi)
            Lahav    Kramim       date     sat
layer.1 0.3053538 0.2840073 1998-10-15 Landsat
layer.2 0.2487563 0.2190098 2000-10-04 Landsat
layer.3 0.2844870 0.2433260 2003-09-11 Landsat
modis.1 0.3725111 0.3416607 2000-02-18   MODIS
modis.2 0.3959158 0.3850857 2000-03-05   MODIS
modis.3 0.4102210 0.3956179 2000-03-21   MODIS

What is left to be done is transform the data.frame object to a longer form (see Chapter 3, Working with Tables), creating another column for forest identity and to transfer the NDVI values to a designated values column. This can be performed with the melt function from the reshape2 package:

> library(reshape2)
> forests_ndvi = melt(forests_ndvi,
+ measure.vars = forests$name,
+ variable.name = "forest",
+ value.name = "ndvi")

Note that the measured variables here are "Lahav" and "Kramim", while the rest are treated as ID variables. Instead of typing the measure variable column names, we passed the forests$name vector, which already contains the necessary names:

> forests$name
[1] "Lahav"  "Kramim"

The additional parameters variable.name and value.name in the melt function call are used to specify the names of the newly created variable and value columns (to replace the default names "variable" and "value", respectively). The final table looks as follows:

> head(forests_ndvi)
        date     sat forest      ndvi
1 1998-10-15 Landsat  Lahav 0.3053538
2 2000-10-04 Landsat  Lahav 0.2487563
3 2003-09-11 Landsat  Lahav 0.2844870
4 2000-02-18   MODIS  Lahav 0.3725111
5 2000-03-05   MODIS  Lahav 0.3959158
6 2000-03-21   MODIS  Lahav 0.4102210

In Chapter 9, Advanced Visualization of Spatial Data, we are going to use this table to create a plot of the NDVI evolution over time in these two forests. A table of this form, where:

  • Each variable forms a column
  • Each observation forms a row
  • There is only one type of observational unit

constitutes a so-called tidy data table (see the paper by Hadley Wickham on this subject, 2014). As we shall see, bringing our data to such a form is often required to use more sophisticated graphical functions such as those in the ggplot2 package.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.79.84