Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Kevin Feasel

Finding Ghosts in Your Data

Anomaly Detection Techniques with Examples in Python

The Apress logo.

Kevin Feasel

DURHAM, NC, USA

ISBN 978-1-4842-8869-6e-ISBN 978-1-4842-8870-2

https://doi.org/10.1007/978-1-4842-8870-2

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed.

The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use.

The publisher, the authors, and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This Apress imprint is published by the registered company APress Media, LLC, part of Springer Nature.

The registered company address is: 1 New York Plaza, New York, NY 10004, U.S.A.

To Mom and Dad, who know a thing or four about anomalies.

Introduction

Welcome to this book on anomaly detection! Over the course of this book, we are going to build an anomaly detection engine in Python. In order to do that, we must first answer the question, “What is an anomaly?” Such a question has a simple answer, but in providing the simple answer, we open the door to more questions, whose answers open yet more doors. This is the joy and curse of the academic world: we can always go a little bit further down the rabbit hole.

Before we start diving into rabbit holes, however, let’s level-set expectations. All of the code in this book will be in Python. This is certainly not the only language you can use for the purpose—my esteemed technical reviewer, another colleague, and I wrote an anomaly detection engine using a combination of C# and R, so nothing requires that we use Python. We do cover language and other design choices in the book, so I’ll spare you the rest here. As far as your comfort level with Python goes, the purpose of this book is not to teach you the language, so I will assume some familiarity with the language. I do, of course, provide context to the code we will write and will spend extra time on concepts that are less intuitive. Furthermore, all of the code we will use in the book is available in an accompanying GitHub repository at https://github.com/Apress/finding-ghosts-in-your-data.

My goal in this book is not just to write an anomaly detection engine—it is to straddle the line between the academic and development worlds. There is a rich literature around anomaly detection, but much of the literature is dense and steeped with formal logic. I want to bring you some of the best insights from that academic literature but expose it in a way that makes sense for the large majority of developers. For this reason, each part in the book will have at least one chapter dedicated to theory. In addition, most of the code-writing chapters also start with the theory because it isn’t enough simply to type out a few commands or check a project’s readme for a sample method call; I want to help you understand why something is important, when an approach can work, and when the approach may fail. Furthermore, should you wish to take your own dive into the literature, the bibliography at the end of the book includes a variety of academic resources.

Before I sign off and we jump into the book, I want to give a special thank you to my colleague and technical editor, Ting Chou. I have the utmost respect for Ting’s skills, so much so that I tried to get her to coauthor the book with me! She did a lot to keep me on the right path and heavily influenced the final shape of this book, including certain choices of algorithms and parts of the tech stack that we will use. That said, any errors are, of course, mine and mine alone. Unfortunately.

If you have thoughts on the book or on anomaly detection, I’d love to hear from you. The easiest way to reach out is via email: [email protected]. In the meantime, I hope you enjoy the book.

Table of Contents

Part I: What Is an Anomaly?1

Chapter 3: Formalizing Anomaly Detection43

The Importance of Formalization43

“I’ll Know It When I See It” Isn’t Enough43

Human Fallibility44

Marginal Outliers44

The Limits of Visualization45

The First Formal Tool: Univariate Analysis46

Distributions and Histograms46

The Normal Distribution49

Mean, Variance, and Standard Deviation51

Additional Distributions54

Robustness and the Mean58

The Susceptibility of Outliers58

The Median and “Robust” Statistics58

Beyond the Median: Calculating Percentiles59

Control Charts61

Conclusion62

Part II: Building an Anomaly Detector63

Chapter 4: Laying Out the Framework65

Tools of the Trade65

Choosing a Programming Language65

Making Plumbing Choices66

Reducing Architectural Variables68

Developing an Initial Framework69

Battlespace Preparation69

Framing the API70

Input and Output Signatures72

Defining a Common Signature73

Defining an Outlier74

Sensitivity and Fraction of Anomalies74

Single Solution75

Combined Arms75

Framing the Solution76

Containerizing the Solution79

Conclusion80

Chapter 5: Building a Test Suite81

Tools of the Trade81

Unit Test Library82

Integration Testing82

Writing Testable Code83

Keep Methods Separated83

Emphasize Use Cases84

Functional or Clean: Your Choice84

Creating the Initial Tests86

Unit Tests86

Integration Tests90

Conclusion94

Chapter 6: Implementing the First Methods95

A Motivating Example95

Ensembling As a Technique96

Sequential Ensembling97

Independent Ensembling98

Choosing Between Sequential and Independent Ensembling99

Implementing the First Checks99

Standard Deviations from the Mean100

Median Absolute Deviations from the Median101

Distance from the Interquartile Range102

Completing the run_tests( ) Function103

Building a Scoreboard104

Weighting Results105

Determining Outliers106

Updating Tests109

Updating Unit Tests109

Updating Integration Tests114

Conclusion116

Chapter 7: Extending the Ensemble117

Adding New Tests117

Checking for Normality118

Approaching Normality123

A Framework for New Tests126

Grubbs’ Test for Outliers128

Generalized ESD Test for Outliers129

Dixon’s Q Test131

Calling the Tests133

Updating Tests135

Updating Unit Tests135

Updating Integration Tests140

Multi-peaked Data141

A Hidden Assumption141

The Solution: A Sneak Peek143

Conclusion144

Chapter 8: Visualize the Results145

Building a Plan145

What Do We Want to Show?145

How Do We Want to Show It?146

Developing a Visualization App147

Getting Started with Streamlit147

Building the Initial Screen148

Displaying Results and Details151

Conclusion157

Part III: Multivariate Anomaly Detection159

Chapter 9: Clustering and Anomalies161

What Is Clustering?161

Common Cluster Terminology162

K-Means Clustering163

K-Nearest Neighbors168

When Clustering Makes Sense170

Gaussian Mixture Modeling171

Implementing a Univariate Version172

Updating Tests176

Common Problems with Clusters179

Choosing the Correct Number of Clusters179

Clustering Is Nondeterministic180

Alternative Approaches182

Tree-Based Approaches182

The Problem with Trees183

Conclusion183

Chapter 10: Connectivity-Based Outlier Factor (COF)185

Distance or Density?185

Local Outlier Factor187

Connectivity-Based Outlier Factor189

Introducing Multivariate Support191

Laying the Groundwork191

Implementing COF194

Test and Website Updates197

Unit Test Updates197

Integration Test Updates198

Website Updates198

Conclusion201

Chapter 11: Local Correlation Integral (LOCI)203

Local Correlation Integral203

Discovering the Neighborhood203

Multi-granularity Deviation Factor (MDEF)205

Multivariate Algorithm Ensembles206

Ensemble Types206

COF Combinations207

Incorporating LOCI210

Test and Website Updates213

Unit Test Updates213

Website Updates214

Conclusion215

Chapter 12: Copula-Based Outlier Detection (COPOD)217

Copula-Based Outlier Detection217

What’s a Copula?217

Intuition Behind COPOD218

Implementing COPOD221

Test and Website Updates223

Unit Test Updates223

Integration Test Updates224

Website Updates225

Conclusion228

Part IV: Time Series Anomaly Detection229

Chapter 13: Time and Anomalies231

What Is Time Series?231

Time Series Changes Our Thinking233

Autocorrelation233

Smooth Movement234

The Nature of Change235

Data Requirements238

Time Series Modeling239

(Weighted) Moving Average239

Exponential Smoothing239

Autoregressive Models241

What Constitutes an Outlier?242

Local Outlier242

Behavioral Changes over Time243

Local Non-outlier in a Global Change243

Differences from Peer Groups243

Common Classes of Technique244

Conclusion244

Chapter 14: Change Point Detection247

What Is Change Point Detection?247

Benefits of Change Point Detection248

Change Point Detection with ruptures249

Dynamic Programming249

PELT250

Implementing Change Point Detection250

Test and Website Updates255

Unit Tests255

Integration Tests257

Website Updates258

Avenues of Further Improvement260

Conclusion261

Chapter 15: An Introduction to Multi-series Anomaly Detection263

What Is Multi-series Time Series?263

Key Aspects of Multi-series Time Series264

What Needs to Change?267

What’s the Difference?267

Leading and Lagging Factors268

Available Processes268

Cross-Euclidean Distance270

Cross-Correlation Coefficient270

SameTrend (STREND)271

Common Problems272

Conclusion273

Chapter 16: Standard Deviation of Differences (DIFFSTD)275

What Is DIFFSTD?275

Calculating DIFFSTD275

Comparing the Norm280

Determining Outliers283

Test and Website Updates286

Unit Tests286

Integration Tests287

Website Updates289

Conclusion292

Chapter 17: Symbolic Aggregate Approximation (SAX)293

What Is SAX?293

Motifs and Discords294

Subsequences and Matches295

Discretizing the Data296

Implementing SAX300

Segmentation and Blocking300

Making SAX Multi-series303

Scoring Outliers304

Test and Website Updates307

Unit and Integration Tests307

Website Updates308

Conclusion308

Appendix: Bibliography341

Index345

About the Author

Kevin Feasel

is a Microsoft Data Platform MVP and CTO at Faregame Inc., where he specializes in data analytics with T-SQL and R, forcing Spark clusters to do his bidding, fighting with Kafka, and pulling rabbits out of hats on demand. He is the lead contributor to Curated SQL, president of the Triangle Area SQL Server Users Group, and author of PolyBase Revealed. A resident of Durham, North Carolina, he can be found cycling the trails along the triangle whenever the weather's nice enough.

About the Technical Reviewer

Yin-Ting (Ting) Chou

is currently a Data Engineer/Full-Stack Data Scientist at ChannelAdvisor. She has been a key member on several large-scale data science projects, including demand forecasting, anomaly detection, and social network analysis. Even though she is keen on data analysis, which drove her to obtain her master's degree in statistics from the University of Minnesota, Twin Cities, she also believes that the other key to success in a machine learning project is to have an efficient and effective system to support the whole model productizing process. To create the system, she is currently diving into the fields of MLOps and containers. For more information about her, visit www.yintingchou.com.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Front Matter

Create new playlist

Sign In

Sign Up

Finding Ghosts in Your Data

Anomaly Detection Techniques with Examples in Python

Table of Contents for
Front Matter