Application – Outlier detection

You might remember that at the beginning of the chapter, we noticed in the stacked bar plot that in our sample of 1,000 roulette spins, the zero was drawn about twice as often as we would expect. We just mentioned it but didn't really have a point of comparison. We now have proportions from 100 samples and thus can examine this a little further. The proportion of zeros can be obtained from the data we have as we simply have to subtract from 1, the sum of proportions of red and black numbers for each of the samples. So let's do this, and add the attribute to the data frame, and get the mean value of this proportion:

samples$isZero = 1-(samples$isRed+samples$isBlack)
Mean = mean(samples$isZero)
Mean

The mean value is 0.0277. We can compute the value we would expect is 1/37, which is 0.0270. The average value of the proportion of zeros in all our 100 samples is therefore almost identical to the expected value. This in no way means that there are no outliers.

There are several ways to detect outliers. When seeking detection of outliers in multivariate data, the Mahalanobis distance or leverage points can be used. As these do not rely on visualization, we will not discuss them here. The interested reader can refer to the paper Unmasking outliers and leverage points by Rousseeuw and van Zomeren.

Our problem here is univariate, and a simple visualization technique is enough for the current purpose. A simple and classic approach is to see how many of the values (here proportions of 0) fall outside of the mean + or – 3 standard deviations. So let's start by computing the thresholds (lines 1 and 2). We then plot the proportions of zeroes from all our samples (lines 3 and 4). Notice we use the ylim attribute (line 3) to specify that we want the vertical boundaries of our graph to include all possible values (a proportion can range from 0 to 1). We will then add two lines showing the limits of the interval between the mean and 3 standard deviations above and below it (lines 6 and 7):

1    upper = Mean+(3*sd(samples$isZero))
2    lower = Mean-(3*sd(samples$isZero))
3    par(mfrow=c(1,1))
4    plot(samples$isZero, main = "Proportion of zeros", 
5       xlab = "sample", ylab= "", ylim = c(0,1))
6    abline(h=upper)
7    abline(h=lower)0
Application – Outlier detection

Finding extreme proportions of zeros visually

We can notice that there is only one value above the upper threshold. All other values are thus not considered as outliers, as the fit in the range of the mean plus or minus 3 standard deviations. We can also notice that the lower threshold is below 0. This is not possible for proportions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.172.115