© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
K. FeaselFinding Ghosts in Your Datahttps://doi.org/10.1007/978-1-4842-8870-2_4

4. Laying Out the Framework

Kevin Feasel1  
(1)
DURHAM, NC, USA
 

To this point, we have focused entirely on the theoretical aspects of outlier and anomaly detection. We will still need to delve into theory on several other occasions in later chapters, but we have enough to get started on developing a proper anomaly detection service.

In this chapter, we will build the scaffolding for a real-time outlier detection service. Then, in the next chapter, we will integrate a testing library to reduce the risk of breaking our code as we iterate on techniques.

Tools of the Trade

Before we write a line of code, we will need to make several important decisions around programming languages, which packages we want to use to make the process easier, and even how we want to interact with end users and other programs. This last choice includes decisions around protocols and how we wish to serialize and deserialize our data to ensure that calling our service is as easy as possible while still allowing us to deliver on our promises.

Choosing a Programming Language

The first major choice involves picking a programming language. For this book, we will use the Python programming language to work with our service. Python is an extremely popular programming language, both as a general-purpose language and especially in the fields of data science and machine learning. In particular, we will use the Anaconda distribution of Python, as it comes with a variety of useful libraries for our anomaly detection project. This includes pandas, which offers table-like data frames; numpy, a library that provides a variety of mathematical and statistical functions; and scikit-learn, a machine learning library for Python. In addition to these libraries, we will introduce and use several more throughout the course of this book.

Python is, of course, not the only programming language that you can use to build an anomaly detection service. Several other languages have built-in functionality or easy-to-install libraries to make anomaly detection a fairly straightforward process. For example, the R programming language includes a variety of anomaly detection packages, including the excellent anomalydetector library. R is an outstanding domain-specific language for data science, data analysis, and statistics. Although it is not necessarily a great general-purpose programming language, what we want is well within its capabilities, and the author and technical editor have previously combined to build a fully featured anomaly detection service in R as well.

If you wish to stick to general-purpose programming languages, the .NET Framework has two such languages that would also be quite suitable for an anomaly detection engine: the object-oriented C# and functional F# languages both can work with a variety of packages to make statistical analysis easier. The Math.NET project, for example, includes several packages intended for solving problems around combinatorics, numerical analysis, and statistical analysis.

In short, we will use Python because it is both a good choice and a popular choice for this sort of development, but do not feel obligated to use only Python for this task.

Making Plumbing Choices

Now that we have our choice of language covered, we should make a few decisions about how our end users will interact with the service. The first question we should ask is, why do our customers need this service? The more we understand what kinds of problems they want to solve, the better we can tailor our service to fit their needs. For our scenario, let’s suppose that our end users wish to call our service and get a response back in near real time, rather than sending in large datasets and requesting batch operations over the entire dataset. The techniques we will use throughout the book will work just as well for both situations, but how we expose our service to the outside world will differ dramatically as a result of this decision.

Given that we wish to allow end users to interact in a real-time scenario, we will likely wish to use HyperText Transfer Protocol (HTTP) as our transfer mechanism of choice. We could alternatively design our own solution using Transmission Control Protocol (TCP) sockets, but this would add a significant amount of development overhead to creating a solution and would require that our users develop custom clients to interact with our service. Developing your own custom TCP solution could result in better performance but is well outside the scope of this book.

Sticking with HTTP as our protocol of choice, we now have two primary options for how we create services, both of which have a significant amount of support in Python: the Remote Procedure Call framework gRPC or building a Representational State Transfer (REST) API using JavaScript Object Notation (JSON) to pass data back and forth between our service and the client. The gRPC-based solution has some significant advantages, starting with payload size. Payloads in gRPC are in the Protocol Buffer (Protobuf) format, a binary format that compacts down request sizes pretty well. By contrast, JSON is a relatively large, uncompressed text format. JSON is more compact than other formats like Extensible Markup Language (XML), but there can be a significant difference in payload size between Protobuf and JSON, especially in batch processing scenarios. We also have strict contracts when working with gRPC, meaning that clients know exactly what the server will provide, what parameters and flags exist, and how to form requests. This can be a significant advantage over earlier implementations of REST APIs, though more recent REST APIs implement the OpenAPI specification (formerly known as Swagger), which describes REST APIs. The OpenAPI specification provides similar information to what gRPC describes in its contracts, making cross-service development significantly easier.

For the purposes of this book, we will choose to work with a REST API using JSON. There are three reasons for this choice. First, working with REST allows us to use freely available tools to interact with our service, including web browsers. The gRPC framework has very limited support for browser-based interaction, whereas REST works natively with browsers. The second reason for choosing REST over gRPC is that a REST API with JSON support provides us human-readable requests, making it easier for humans to parse and interpret requests and correct potential bugs in the client or the service. There are techniques to translate Protobuf requests to JSON, but for educational purposes, we will stick with the slower and easier-to-understand method. The final reason we will use REST APIs over gRPC is that more developers are familiar with the former than the latter. The move toward microservice-based development has made gRPC a popular framework, but we in the industry have several decades worth of experience with REST and have sorted out most of its problems. If you are developing an anomaly detection suite for production release, however, gRPC would be a great choice, especially if you develop the client or callers are familiar with the framework.

Reducing Architectural Variables

Now that we have landed on some of the “plumbing” decisions, we can look for other ways to simplify the development process. For example, considering that we intend to implement a REST API for callers, there are several Python-based frameworks to make API development easy, including Flask and FastAPI. Flask is a venerable package for core API development and integrates well with proxy solutions like Gunicorn. FastAPI, meanwhile, is a newer solution that has a lot going for it, including great performance and automatic implementation of the OpenAPI specification using a tool called ReDoc. Figure 4-1 shows an example of documentation built from the API we will develop in this chapter.

A screenshot of Fast A P I 0.1.0 for download of documentation with post univariate and several options.

Figure 4-1

Automatic API documentation for our API

Beyond this, we can also use Docker containers to deploy the application. That way, no matter which operating system you are running or what libraries you have installed, you can easily deploy the end solution. Using the container version of the solution we build is optional, but if you do have Docker installed and want to simplify the dependency management process, there are notes for Docker-based deployment at the end of the chapter.

Developing an Initial Framework

After deciding on languages and technologies, it’s time to begin development. In this section, we will install prerequisites for hosting the API. Then, we will create a stub API. In the next section, we will begin filling in the details.

Battlespace Preparation

You will need to have Python 3.5 or later installed on your computer if you wish to follow along and create your own outlier detection API service. The easiest way to get started is to install the Anaconda distribution of Python at https://anaconda.com. From there, install the Anaconda Individual Edition, which is available for Windows, macOS, and Linux. Anaconda also comes with a variety of Python libraries preinstalled, making it a great option for data science and machine learning operations.

Next, if you have not already, grab the code repository for this book at https://github.com/Apress/finding-ghosts-in-your-data and follow the instructions in srcREADME.md, depending on whether you wish to follow along with the code in this book and create your own outlier detector or if you wish to use the completed version of the code base. Note that if you wish to follow along with the code in this book, you should still reference the completed version of the code, as certain segments of code will be elided for the purpose of saving space.

The repository includes several folders. The doc folder includes basic documentation about the project, including how to run different parts of the project. The srcapp folder includes the completed version of the code base we will work throughout the book to create. The srccomp folder includes comparison work we will cover in the final chapter of this book. The srcweb folder includes a companion website we will create and update throughout the book. Leaving the src directory altogether, the est folder features two separate sets of tests: one set of unit tests and one set of integration tests. Finally, we have a Dockerfile for people who wish to run the solution as a Docker container, as well as a requirements.txt file.

After installing Anaconda (or some similar variant of Python), open the requirements.txt file in the code repository. This file contains a set of packages, one per line, which we will use throughout the course of this book. You can install these packages manually using the pip packaging system, or you can open a shell in the base of the repository and run pip install -r requirements.txt.

Tip

If you receive an error message indicating that pip cannot be found, make sure that your Path environment variable includes Anaconda and its directories. For example, if you installed Anaconda at E:Anaconda3, include that directory as well as E:Anaconda3Libraryin, E:Anaconda3Libraryusrin, and E:AnacondaScripts to your path. Then, restart your shell and try again.

If you are using Anaconda, you will want to ensure that you are working under a new conda environment. You can create an environment in conda with the command conda create --name finding_ghosts inside the main directory of the code repository. Then, activate the environment with conda activate finding_ghosts. From there, you can run pip commands without affecting other installations of packages.

Framing the API

Now that we have everything installed, let’s build an API. If you have retrieved the source code from the accompanying repository, be sure to rename the existing app folder to something like app_complete. Then, create a new app folder. Inside this folder, create an empty Python file named __init__.py, ensuring that you have two underscores before and after the word “init.” This file helps by letting the Python interpreter know that our app folder contains code for a Python module. After creating this file, create another file called main.py. This file will be the entry point for our API. Open this new file with the editor of your choice. If you do not have an editor of choice, three good options are Visual Studio Code, Wing IDE, and PyCharm. Anaconda also comes with Spyder, another integrated development environment (IDE) for Python.

At the top of main.py, enter the code from Listing 4-1. This will ensure that we can reference the projects and libraries we will need for our API.
from fastapi import FastAPI
from pydantic import BaseModel
from typing import Optional, List
import pandas as pd
import json
import datetime
Listing 4-1

Import statements in the main.py file

The first three entries relate to the FastAPI library, allowing us to set up an API server, create models to shape our input and output signatures, and allow us to tag specific model elements as optional or as lists of items coming in. The next three entries relate to additional Python libraries that will be helpful. Pandas is a library for data analysis, and one of its most useful features is the ability to create Pandas DataFrames. DataFrames behave similarly to their counterparts in other languages and to tables in SQL, in that they are two-dimensional data structures with named entities, where each entity may have a different type. In addition to Pandas, we import the JSON and datetime libraries to make available their respective functionality.

Now that we have the import statements in place, the next step is to create a new FastAPI application and try it out. Listing 4-2 provides the minimal amount of code needed to build a FastAPI service.
app = FastAPI()
@app.get("/")
def doc():
  return {
    "message": "Welcome to the anomaly detector service, based on the book Finding Ghosts in Your Data!",
    "documentation": "If you want to see the OpenAPI specification, navigate to the /redoc/ path on this server."
  }
Listing 4-2

Create a new API with one endpoint

In this block of code, we first create a new API service called app. We then create a new endpoint for our application at the root directory. This endpoint accepts the GET method in REST. The GET method does not take a request body, and it is expected to return something. When navigating to web pages or clicking links using a browser, the browser translates these statements to GET operations.

After defining the endpoint, we now need to write out what our code should do if someone were to call that endpoint. In this case, we return a message. The message must be valid JSON. In this case, we create a JSON object with two attributes: message and documentation. The documentation attribute points us toward a path /redoc/, meaning that if you are running this on your local machine, you would navigate to http://127.0.0.1/redoc to see an OpenAPI specification similar to that in Figure 4-1.

In order to see this listing, navigate to the /src/ folder in a shell and enter the following command:
uvicorn app.main:app --host 0.0.0.0 --port 80
This command executes the uvicorn web server. It will then look for a folder called app and a file called main.py. Inside that file, uvicorn will find our reference to app and make the API available. We will run this on the local host on port 80, although you can change the host and port as necessary. Navigating to http://localhost will return a JSON snippet similar to that in Figure 4-2.

A screenshot of a J S O N snippet message and notification.

Figure 4-2

The JSON message we get upon performing a GET operation on the / endpoint of the anomaly detector service. Some browsers, like Firefox, will generate an aesthetically pleasing JSON result like you see here; others will simply lay out the JSON as text on the screen

Now that we have a functional API service, the next step is to lay out the input and output signatures we will need throughout the book.

Input and Output Signatures

Each API endpoint will need its own specific input and output structure. In this section, we will lay out the primary endpoints, define their signatures, and create stub methods for each. Stub methods allow us to define the structure of the API without committing to all of the implementation details. They will also allow us to tackle one problem at a time during development while still making clear progress.

In all, we will have four methods, each of which will correspond to several chapters in the book. The first method will detect univariate data using statistical techniques, the project of Part II. The second method will cover multivariate anomaly detection using both clustering and nonclustering techniques, which is Part III. The third method will allow us to perform time series analysis on a single stream of data, the first goal of Part IV. The final method will let us detect anomalies in multiple time series datasets, which we will cover in the second half of Part IV.

Defining a Common Signature

At this point, it makes sense to frame out one of our stub methods and see if we can design something that works reasonably well across all of our inputs. We will start with the simplest case: univariate, unordered, numeric, non-time series data. Each change from there will introduce its own complexities, but there should be a fairly limited modification at the API level, leaving most of that complexity to deeper parts of the solution.

All of the API calls we will need to make will follow a common signature, as defined in Listing 4-3.
@app.post("/detect/univariate")
def post_univariate(
  input_data: List[<input class>],
  debug: bool = False,
  <other inputs>
):
  df = pd.DataFrame(i.__dict__ for i in input_data)
  (df, ...) = univariate.detect_univariate_statistical(df, ...)
  results = { "anomalies": json.loads(df.to_json(orient='records')) }
  if (debug):
    # TODO: add debug data
    results.update({ "debug_msg": "This is a logging message." })
  return results
Listing 4.3

The shell of a common solution

We first need to define that we will make a POST call to the service. This method, unlike GET, does accept a request body and will allow us to pass in a dataset in JSON format. The post call will also define the endpoint we need to call. In this case, we will call the /detect/univariate endpoint, which will trigger a call to the post_univariate() method. This method will look very similar for each of the four cases.

The first thing we will do is create a Pandas DataFrame from our JSON input. Putting the data into a DataFrame will make it much easier for us to operate on our input data, and many of the analysis techniques we will use throughout the book rely on data being in DataFrames.

After creating a DataFrame, we will call a detection method. In this case, the function is called detect_univariate_statistical() and will return a new DataFrame that includes the original data, along with an additional attribute describing whether that result is anomalous. We can then reshape this data as JSON, which is important because the outside world runs on JSON, not Pandas DataFrames.

We’ll also include a debug section, which will help during development and include troubleshooting information. For now, we will leave this as a stub with some basic details, but as we flesh out exactly how the processes work, we will have a better understanding of what we can and should include here. These debug results will be in the same output JSON as our set of anomalies, meaning that our caller will not need any additional software or capabilities to read debug information.

Finally, we return the results JSON object. This same basic layout will serve us well for the other three methods.

Defining an Outlier

We now have defined most of the base methods for detecting outliers, but there are a couple of blocks we still need to fill out, particularly around inputs. As we develop methods, we will want to offer users some level of control concerning what constitutes an outlier. It is important to note, however, that we do not expect users to know the exact criteria that would define an outlier—if they did, they wouldn’t need an outlier detector in the first place! There are, however, a couple of measures we can expect a user to control.

Sensitivity and Fraction of Anomalies

The first key measure is sensitivity. In this case, we do not mean sensitivity as a technical definition but rather a sliding scale. Ideally, callers will be able to control—without knowing exact details about thresholds—the likelihood that our outlier detector will flag something as an outlier. We will have the sensitivity score range from 1 to 100 inclusive, where 1 is least sensitive and 100 is most sensitive. How, exactly, we implement this will depend on the technique, but this gives callers a simple mechanism for the purpose.

Additionally, we will need to control the maximum fraction of anomalies we would expect to see in the dataset. Some outlier detection techniques expect an input that includes the maximum number of items that could be anomalous. Rather than trying to estimate this for our users, we can ask directly for this answer. This will range from 0.0 to 1.0 inclusive, where 1.0 means that every data point could potentially be an outlier. In practice, we would expect values ranging between 0.05 and 0.10, meaning no more than 510% of records are outliers. This fraction can also give us a hard limit on how many items we mark as outliers, meaning that if a caller sends in 100 data points and a max fraction of anomalies of 0.1, we guarantee we will return no more than ten outliers. This will work in conjunction with the sensitivity score: think of sensitivity score as a sliding scale and max fraction of anomalies as a hard cap.

Single Solution

If, for a given class of problem, there is a single best technique to solve the problem, we can use this technique to the exclusion of any other possible techniques. In that case, calculations are fairly straightforward: we apply the input data to the given algorithm, sending in the sensitivity score and max fraction of anomalies inputs if the technique calls for either (or both). We get back the number of outlier items and then need to determine whether we can simply send back this result set or if we need to perform additional work. For cases in which the input technique accepts sensitivity score or max fraction of anomalies, our work is probably complete by then. Otherwise, we will need to develop a way to apply the sensitivity score and then cut off any items beyond the max fraction of anomalies. This works because each technique will apply a score to each data point and we can order these scores in such a way that higher scores are more likely to be outliers. With these scores, even if the technique we use has no concept of sensitivity or a cutoff point, we can implement that ourselves.

Combined Arms

The more difficult scenario is when we need to create ensemble models, that is, models with several input algorithms. Here, we will need a bit more control over which items get marked as outliers, as each technique in the ensemble will have its own opinion of the outlier-worthiness of a given data point. We will need to agglomerate the results and perform the final scoring and cutoffs ourselves.

We will also want to provide weights for each input algorithm. Some techniques are more adept at finding outliers than others, and so we will want to weigh them more heavily. But there is a lot of value in incorporating a variety of input strategies, as no single technique is going to be perfect at finding all anomalies and ignoring all noise. For this reason, even relatively noisier techniques can still provide value, especially when there are several of them. Our hope is that the noise “cancels out” between the techniques, leaving us with more anomalies and (ideally!) less noise.

In the simplest case, we can weight each of the techniques equally, which is the same as saying that we let each technique vote once and count the number of votes to determine whether a particular data point is an outlier or not. With larger numbers of techniques, this may itself be a valid criterion for sensitivity score—suppose we have 20 separate tests. We could give each test a weight score of 5, giving us 5 * 20 = 100 as our highest score. This aligns quite nicely with sensitivity score, so if the caller sends in a sensitivity score of 75, that means a particular data point must be an outlier for at least 15 of the 20 tests in order to appear on our final list.

With differential weighting, the math is fundamentally the same, but it does get a little more complicated. Instead of assigning 5 points per triggered test, we might assign some tests at a value of 10 and others 2 based on perceived accuracy. The end result is that we still expect the total score to add up to 100, as that lets us keep the score in alignment with our sensitivity score.

Regardless of how we weight the techniques, we will want to send back information on how we perform this weighting. That way, we will be able to debug techniques and get a better feeling for whether our sensitivity score is working as expected.

Framing the Solution

With the context of the prior section in mind, let’s revisit the code in Listing 4-3 and expand the solution to include everything we need. As far as inputs go, we will need to define input_data as the correct kind of list. FastAPI includes a concept called BaseModel that allows us easily to interpret the JSON our caller passes in and convert it into an object that Python understands. For univariate statistical input, we will need a single value column. It also would be proper to add a key column as well. That way, if the caller wishes to assign specific meaning to a particular value, they can use the key column to do so. We will not use the key as part of data analysis but will retain and return it to the user.

The detect_univariate_statistical() function should take in the sensitivity score and max fraction of anomalies and output the weights assigned for particular models, as well as any other details that might make sense to include for debugging.

Listing 4-4 shows an updated version of the univariate detection function with the addition of sensitivity score and the maximum fraction of anomalies as inputs, as well as weights and model details as outputs.
class Univariate_Statistical_Input(BaseModel):
  key: str
  value: float
@app.post("/detect/univariate")
def post_univariate(
  input_data: List[Univariate_Statistical_Input],
  sensitivity_score: float = 50,
  max_fraction_anomalies: float = 1.0,
  debug: bool = False
):
  df = pd.DataFrame(i.__dict__ for i in input_data)
  (df, weights, details) = univariate.detect_univariate_statistical(df, sensitivity_score, max_fraction_anomalies)
  results = { "anomalies": json.loads(df.to_json(orient='records')) }
  if (debug):
    # TODO: add debug data
    # Weights, ensemble details, etc.
    results.update({ "debug_msg": "This is a logging message." })
    results.update({ "debug_weights": weights })
    results.update({ "debug_details": details })
  return results
Listing 4-4

The newly updated API call for univariate outlier detection

The other methods will have very similar method signatures, although each of the calls will require its own model. For multivariate outlier detection, we will replace the single float called value with a list of values. Single-input time series anomaly detection will bring back the singular value but will include a dt parameter for the date and time. Finally, multi-series time series anomaly detection will add the date and also a series_key value, which represents the specific time series to which a given data point belongs.

After sketching out what the API should look like—and understanding that we are not yet complete, as we will need to incorporate debugging information as we design the process—we can then stub out each of the four detection methods. A stub or stub method is a technique for rapid development in which we hard-code the output signature of a function so that we can work on the operations that call this function before filling in all of the details. Creating stub methods allows us to solve the higher-level problem of creating our API and ensuring the API code itself works before trying to tackle the much tougher job of implementing outlier detection. Then, when we are ready to dive into each technique, we already have a shell of the code ready for us to use.

To implement our stub methods, we will create a models folder and then one file per detection method. Listing 4-5 shows what the stub method looks like for detect_univariate_statistical.
import pandas as pd
def detect_univariate_statistical(
  df,
  sensitivity_score,
  max_fraction_anomalies
):
  df_out = df.assign(is_anomaly=False, anomaly_score=0.0)
  return (df_out, [0,0,0], "No ensemble chosen.")
Listing 4-5

The stub method for detecting univariate outliers via the use of statistical methods. The output is hard-coded to return something that looks like a proper response but does not require us to implement the underlying code to generate correct results.

The other functions will look similar to this, except that there may be additional parameters based on the specific nature of the call—for example, multivariate outlier detection requires a parameter for the number of neighboring points we use for analysis. Once those are in place, we’ve done it: we have a functional API server and the contract we will provide to end users; if they pass in data in the right format and call the right API endpoint, we will detect outliers in that given dataset and return them to the end user. The last thing we should do in this chapter is to package this up to make it easier to deploy.

Containerizing the Solution

We have a solution and we can run it as is on our local machines, but suppose we wish to deploy this solution out to a service like Azure Application Services. How would we package up this code and make it available to run? One of the easiest ways of packaging up a solution is to containerize it. The big idea behind containers is that we wish to pull together code and any project dependencies into one single image. This image can then be deployed as an independent container anywhere that runs the appropriate software. I can package up the software from this API and make the image available to you. Then, you can take that image and deploy it in your own environment without needing to install anything else.

Installing and configuring containerization solutions like Docker is well outside the scope of this book. If you have or can install the appropriate software, read on; if not, you can run everything in this book outside of containers with no loss in understanding.

For Windows users, there is, at the time of this writing, one major product available for working with containers: Docker Desktop. Docker Desktop is free for personal, noncommercial use. For Mac users, Docker Desktop works, but there is also a free alternative called Lima. Linux users have a wide variety of containerd-based options, although I am partial to Moby. Regardless of the product you use, the Dockerfile will be the same for use. Listing 4-6 shows the Dockerfile we will use to start up a containerized version of our anomaly detector.
FROM python:3.9
WORKDIR /code
COPY ./requirements.txt /code/requirements.txt
RUN pip install --no-cache-dir --upgrade -r /code/requirements.txt
COPY ./src/app /code/app
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "80"]
Listing 4-6

The contents of the Dockerfile we will use in this book

Python version 3.9 is the latest version as of the time of writing, so we will use that. The rest of this is fairly straightforward: install Python packages from the requirements.txt file, move our application to its resting place, and run uvicorn.

If you wish to use the containerized solution, you can find additional instructions in the readme for the GitHub repo associated with the book at https://github.com/Apress/finding-ghosts-in-your-data.

Conclusion

In this chapter, we started breaking ground on our anomaly detection solution. That includes deciding on the “plumbing” in terms of protocols and base applications, as well as creating the interface that our callers will use as they input data and expect a list of outliers back. In the next chapter, we will create an accompanying test project to ensure that as we change code throughout the book, everything continues to work as expected.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.253.62