Dealing with missing values

In this section, we are going to introduce the representation of missing values in R and ways to deal with them. Missing values can arise in many situations during data collection and analysis, either when the required information could not be acquired for some reason or when, due to certain circumstances, we would like to exclude some data from an analysis by marking them as missing. In the spatial data analysis context, it can be that some districts in an area we surveyed were inaccessible for data collection by the researcher or some parts of an aerial image were clouded and we could not digitize features of interest there.

Missing values and their effect on data

The special value that marks missing values in R is NA. As briefly mentioned in the previous chapter, NaN values represent cases when the resulting value cannot be represented within the real system number. NaN values function in the same way as NA in all respects that are relevant here.

The same way that NaN values can result from inappropriate calculations (such as 0 divided by 0), NA values are created when there is not enough information to provide a result. For example, the 100th element of a vector that has only 10 elements is not available:

> x = 1:10
> x[100]
[1] NA

The average of a set of numbers including at least one NA is NA since the average can be ascertained only when all of the values it is based upon are known:

> x = c(2,5,1,0)
> mean(x)
[1] 2
> x[2] = NA
> x
[1]  2 NA  1  0
> mean(x)
[1] NA

At times, we will be interested in marking certain types of values as NA. For example, if we have a dataset of a car's driving speeds with one of the values being 900 km/h we will likely mark it as a typing error. Other times, the data we get to analyze will have a specific encoding to mark the missing values that people who created the data decided upon (for example -9999), and we would like to convert those values to NA in R. We will see an example of this later.

Detecting missing values in vectors

The is.na function indicates whether a given element of a vector is NA (in which case TRUE is returned) or not (in which case FALSE is returned). Let's take a look at the following examples:

> x = c(2,5,1,0)
> x[2] = NA
> x
[1]  2 NA  1  0
> is.na(x)
[1] FALSE  TRUE FALSE FALSE

At times, it is more convenient to check which values in a vector are not NA, rather than to check which are. To do this, we can use the ! operator, which we encountered in the previous chapter, to transpose the resulting logical vector:

> !is.na(x)
[1]  TRUE FALSE  TRUE  TRUE

For example, if we would like to have a subset of only the non-missing elements in x, we can type the following code:

> x[!is.na(x)]
[1] 2 1 0

Performing calculations on vectors with missing values

Continuing the previous example, the mean of the non-missing elements in x can be computed if we subset only the non-missing values:

> mean(x[!is.na(x)])
[1] 1

To save us the need of manually removing missing values from a vector prior to such calculations, many functions that require all values to be non-missing (such as mean, min, and max) have a parameter called na.rm to indicate whether we would like to remove the missing values before executing the calculation. The default for this parameter is FALSE (which means that we do not remove the NA values); if we would like the opposite, we need to specify na.rm=TRUE:

> x = c(3,8,2,NA,1,7,5,NA,9)
> mean(x)
[1] NA
> mean(x, na.rm = TRUE)
[1] 5
> max(x)
[1] NA
> max(x, na.rm = TRUE)
[1] 9
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.82.154