Chapter 6. Real-Time Machine Learning Applications

Combining terms like “real time” and “machine learning” runs the risk of drifting into the realm of buzz and away from business problems. However, improvements in real-time data processing systems and the ready availability of machine learning libraries make it so that applying machine learning to real-time problems is not only possible, but in many cases simply requires connecting the dots on a few crucial components.

The stereotype about data science is that its practitioners operate in silos, issuing declarations about data from on high, removed from the operational aspects of a business. This mindset reflects the latency and difficulty associated with legacy data analysis and processing toolchains. In contrast, modern database and data processing system design must embrace modularity and accessibility of data.

Real-Time Applications of Supervised Learning

The power of supervised learning applies to numerous business problems. Regression is familiar to any data scientist, finance or risk analyst, or anyone who took a statistics class in college. What has changed recently is the availability of powerful data processing software that enables businesses to apply these tools to extremely low-latency problems.

Real-Time Scoring

Any system that automates data-driven decision making, for instance, will need to not only build and train a model, but use the model to score or make predictions. When developing a model, data scientists generally work in some development environment tailored for statistics and machine learning, such as Python, R, and Apache Spark. Using these tools, data scientists can train and test models all in one place and offer powerful abstractions for building models and making predictions while writing relatively little code.

Many data science tools are designed for interactive use by data scientists, rather than to power production systems.1 See Figure 6-1 for an example of such interactive data analysis. Even though interactive tools are great for development, they are not designed for extremely low-latency production applications. For instance, a digital advertising network has on the order of milliseconds to choose and display an advertisement before a web page loads. In many cases, primarily interactive tools do not offer low enough latency for production use cases.

Interactive data analysis in Python
Figure 6-1. Interactive data analysis in Python

This is why when we talk about supervised learning we need to distinguish the way we think about scoring in development versus in production. Scoring latency in a development environment is generally just an annoyance. In production, the reduction or elimination of latency is a source of competitive advantage. This means that, when considering techniques and algorithms for predictive analytics, you need to evaluate not only the bias and variance of the model, but how the model will actually be used for scoring in production.

Fast Training and Retraining

Sometimes, real-world factors change (in real time) the phenomenon you are trying to model. When this happens, a production system can actually be bound by training latency rather than scoring latency. For instance, the relative weight, or ability to affect the overall outcome, of certain features might change. In this case, you cannot simply train a model once and expect it to give accurate answers forever. This scenario is especially common in “of the moment” trend spotting like in social media and digital advertising.

Deciding on the most efficient algorithm for training a given type of model exceeds the scope of this book. For many machine learning techniques, this remains an open question. However, there are straightforward system design considerations to enable fast training.

In-memory storage
In-memory storage is a prerequisite for performing analytics on changing datasets. Locking and hardware contention make it difficult or impossible to manage fast-changing datasets on disk, let alone to make the data simultaneously available for machine learning.
Access to real-time and historical data
A major component in modeling changing systems is understanding when your model needs to change. Making this decision requires simultaneous access to both the most recent data, and historical data on which the current and/or previous models are based. Whereas an offline data lake might satisfy the needs of early-stage model development, modeling a dynamic system requires faster access to data.
Convergence of systems
Any large-scale data processing pipeline will include multiple systems, but limiting the number reduces latency associated with data transfer and other intersystem communication. For instance, suppose that you want to write a program that builds a model using data from the last fraction of a second. By, for instance, converging scoring and transaction processing in a single database, you save valuable time that would have been lost to data transfer. For real-time applications like digital advertising, a fraction of a second is the entire window for the transaction.

Unsupervised Learning

Given the nature of unsupervised learning, its real-time applications can be less intuitive than the applications for supervised learning. Recall that unsupervised learning occurs on unlabeled data, which it is commonly associated with more offline tasks like data mining. Many unsupervised learning problems do not have an analogue to real-time scoring in a regression or classification model. However, advances in data processing technology have opened opportunities to use unsupervised learning techniques like clustering and anomaly detection in real-time capacities.

Real-Time Anomaly Detection

One of the most promising applications of real-time unsupervised learning is real-time anomaly detection, which you can use to strengthen monitoring applications; for example, Internet security and industrial machine data.

The nature of unsupervised learning in general, and anomaly detection in particular, is that you do not know exactly what you are looking for. Depending on the nature of the dataset and whether and how it changes over time, clusters and what constitutes an outlier can also change.

Suppose that you are monitoring network traffic. You might track information like the IP addresses, average response times, and amount of data sent over the network. The values of some of these features can change dramatically based on factors like amount of network traffic, and even factors completely beyond the scope of your system, such as an Internet Service Provider outage. Data that indicates a “normal” network user might be very different under unusual circumstances.

Real-Time Clustering

There are many scenarios for which solving a clustering problem will require frequent retraining. First, though, here’s a corollary to the previous section about anomaly detection. As discussed in Chapter 5, clustering techniques are often used to detect anomalies because, in finding clusters of similar data, everything left out of a cluster is by definition an anomaly. There are also straightforward “clustering as clustering” real-time applications for which the groupings of data are changing. A high-traffic digital media company, for instance, might want to build a clustering model to determine which videos and articles are consumed together. Given modern news cycles and the dissemination of viral content, it is easy to envision the clustering of “related” videos and articles changing rapidly.

Even when the underlying phenomenon you are modeling is relatively static, meaning clusters are not shifting in real time, there can still be immense value to frequent retraining. As more training data becomes available, retraining with more information can yield different clustering results.

1 Apache Spark has a wide range of both production and development uses. The claim here is merely that starting and stopping Spark jobs is time and resource-intensive, which prevents it from being used as a real-time scoring engine.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.116.49.243