Spatial statistics

Most exploratory data analysis projects dealing with spatial data start by looking for, and potentially filtering, spatial autocorrelation. In simple terms, this means that we are looking for spatial effects in the data—for instance, the similarities of some data points can be (partly) explained by the short distance between them; further points seem to differ a lot more. There is nothing surprising in this statement; probably all of you agree with this. But how can we test this on real data with analytical tools?

Moran's I index is a well-known and generally used measure to test whether spatial autocorrelation is present or not in the variable of interest. This is a quite simple statistical test with the null hypothesis that there is no spatial autocorrelation in the dataset.

With the current data structure we have, probably the easiest way to compute Moran's I is to load the ape package, and pass the similarity matrix along with the variable of interest to the Moran.I function. First, let's compute this similarity matrix by the inverse of the Euclidian distance matrix:

> dm <- dist(dt[, c('lon', 'lat')])
> dm <- as.matrix(dm)
> idm <- 1 / dm
> diag(idm) <- 0
> str(idm)
 num [1:88, 1:88] 0 0.0343 0.1355 0.2733 0.0467 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:88] "1" "3" "6" "7" ...
  ..$ : chr [1:88] "1" "3" "6" "7" ...

Then let's replace all possible missing values (because the number of flights can be one as well, resulting in zero variance) in the TimeVar column, and let's see if there is any spatial autocorrelation in the variance of the actual elapsed time of the flights:

> dt$TimeVar[is.na(dt$TimeVar)] <- 0
> library(ape)
> Moran.I(dt$TimeVar, idm)
$observed
[1] 0.1895178

$expected
[1] -0.01149425

$sd
[1] 0.02689139

$p.value
[1] 7.727152e-14

This was pretty easy, wasn't it? Based on the returned P value, we can reject the null hypothesis, and the 0.19 Moran's I suggests that the variation in the elapsed flight time is affected by the location of the destination airports, probably due to the very different distances.

A reverse dependency of the previously mentioned sp package, the spdep package can also compute this index, although we have to first transform the similarity matrix into a list object:

> library(spdep)
> idml <- mat2listw(idm)
> moran.test(dt$TimeVar, idml)

  Moran's I test under randomisation

data:  dt$TimeVar  
weights: idml  

Moran I statistic standard deviate = 1.7157, p-value = 0.04311
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
      0.108750656      -0.011494253       0.004911818

Although the test results are similar to the previous run, and we can reject the null hypothesis of zero spatial autocorrelation in the data, the Moran's I index and the P values are not identical. This is mainly due to the fact that the ape package used weight matrix for the computation, while the moran.test function was intended to be used with polygon data, as it requires the neighbor lists of the data. Well, as our example included point data, this is not a clean-cut solution. Another main difference between the approaches is that the ape package uses normal approximation, while spdep implements randomization. But this difference is still way too high, isn't it?

Reading the function documentation reveals that we can improve the spdep approach: when converting the matrix into a listw object, we can specify the actual type of the originating matrix. In our case, as we are using the inverse distance matrix, a row-standardized style seems more appropriate:

> idml <- mat2listw(idm, style = "W")
> moran.test(dt$TimeVar, idml)

  Moran's I test under randomisation

data:  dt$TimeVar  
weights: idml  
Moran I statistic standard deviate = 7.475, p-value = 3.861e-14
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     0.1895177587     -0.0114942529      0.0007231471

Now the differences between this and the ape results are in an acceptable range, right?

Unfortunately, this section cannot cover related questions or other statistical methods dealing with spatial data, but there are many really useful books out there dedicated to the topic. Please be sure to check the Appendix at the end of the book for some suggested titles.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.40.32