Identifying outlier fares with anomaly detection techniques

There are various rigorous definitions of outliers, but for our purposes, an outlier is any extreme value that is far from the other observations in the dataset. There are numerous techniques, both parametric and non-parametric, that are used to identify outliers; example algorithms include density-based spatial clustering of applications with noise (DBSCAN), isolation forests, and Grubbs' Test. Typically, the type of data determines the type of algorithm that is used. For example, some algorithms do better on multivariate data than univariate data. Here, we are dealing with univariate time-series data, so we'll want to choose an algorithm that handles that well.

If you aren't familiar with the term time series, it simply means data that is recorded at regular intervals, such as the daily closing price of a stock.

The algorithm that we are going to use for our data is called Generalized Extreme Studentized Deviate (Generalized ESD) test for outliers. This algorithm is well suited for our data, since it is univariate and approximately normal.

There are several tests we can use to ensure that our data is approximately normally distributed, but we can also visually inspect our data for normality using a normal probability plot. We'll do that now for Moscow city data using some functionality from the SciPy library:

from scipy import stats 
fix, ax = plt.subplots(figsize=(10,6)) 
stats.probplot(list(city_dict.values()), plot=plt) 
plt.show() 

The preceding code generates the following output:

When assessing a normal probability or quantile-quantile (Q-Q) plot, we are looking for the data to be as close to the straight line as possible to reveal normality. Data that veers off in one direction or another, or with a strong S shape, argues against normal data. Here, we have a fairly low number of data points, and those that we do have are fairly balanced around the diagonal. If we had more data, it is likely that we would more closely approximate the diagonal. This should work well enough for our purposes.

We'll now move on to our outlier detection code. We are going to be utilizing another library for this called PyAstronomy. If you don't have it, it can easily be pip installed.

Let's look at the code:

from PyAstronomy import pyasl 
 
r = pyasl.generalizedESD(prices, 3, 0.025, fullOutput=True) 
 
print('Total Outliers:', r[0]) 
 
out_dates = {} 
for i in sorted(r[1]): 
    out_dates.update({list(dates)[i]: list(prices)[i]}) 
 
print('Outlier Dates', out_dates.keys(), '
') 
print('     R         Lambda') 
 
for i in range(len(r[2])): 
    print('%2d  %8.5f  %8.5f' % ((i+1), r[2][i], r[3][i])) 
 
fig, ax = plt.subplots(figsize=(10,6)) 
plt.scatter(dates, prices, color='black', s=50) 
ax.set_xticklabels(dates, rotation=-70); 
 
for i in range(r[0]): 
    plt.plot(r[1][i], prices[r[1][i]], 'rp') 

Let's discuss what the preceding code does. The first line is simply our import. Following that, we implement our generalized ESD algorithm. The parameters are our fare prices, then the maximum number of outliers (here, we chose 3), the significance level (alpha, at 0.025), and finally a Boolean to specify that we want the full output. With respect to the significance level, the lower the value, the less sensitive the algorithm and the fewer false positives will be generated.

The next two lines simply print out data related to the R and Lambda values. These are utilized in the determination of whether a data point is an outlier.

Finally, the remainder of the code is simply for generating the scatter plot and coloring those fares that are outliers red.

The preceding code generates the following output:

Again, this data is for Moscow. Make sure you changed your city_key variable to reflect that to ensure you get that data. Notice that despite all the variations, there are no outliers in the data.

Now, let's run it for Milan as well. We'll go back up and change our city_key variable and run the cells below that to update everything, as demonstrated in the following diagram:

Notice that this time, we have three outliers, and these are fares that are under $600 when the mean fare looks to be over $900, so this looks like a win for us.

Let's try another city. This time, we'll look at Athens by updating the city_key variable and running the subsequent cells:

Notice that again, we have three outliers, but that this time, they are extreme fares to the upside. Since we are only interested in getting alerts for cheap fares, we can build in a mechanism to only alert us when the fare outlier is less than the mean fare.

Now, we'll now create some code to handle this element:

city_mean = np.mean(list(city_dict.values())) 
 
for k,v in out_dates.items(): 
    if v < city_mean: 
        print('Alert for', city_key + '!') 
        print('Fare: $' + str(v), 'on', k) 
        print('
') 

When we run the code for Athens, it will generate no output. When run for Milan, it generates the following output:

So, now, we have created a system to scrape the data, parse it, and identify the outliers. Let's move on and create a fully-fledged application that can alert us in real time.

Keep in mind that we just did a very preliminary analysis on our outlier detection model. In the real world, it would likely take a much more thorough series of tests to identify whether we had selected workable parameters for our model.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.89.2