Most of this book is about managing machine learning systems and production level ML pipelines. This involves work that is quite different from the work often performed by many data scientists and machine learning researchers, who ideally spend their days trying to develop new predictive models and methods that can squeeze out another percentage point of accuracy. Instead, in this book, we focus on ensuring that a system that includes an ML model exhibits consistent, robust, and reliable system level behavior. In some ways, this system level behavior is independent of the actual model type, how good the model is, or other model-related considerations. Still, in certain key situations, it is not independent of these considerations. Our goal in this chapter is to give you enough background to understand which situation you are in when the alarms start to go off or the pagers start to fire for your production system.
We will say at the outset that our goal here is not to teach you everything about how to build machine learning models, which models might be good for what problems, or how to become a data scientist. This would be a book (or more) all to itself, and there are many excellent texts and online courses that cover these aspects.
Instead of going too deep into the minutia, in this chapter, we’ll give a very quick reminder about what ML models are and how they work. We’ll also provide some key questions that ML Ops folks should ask about the models in their system so they can understand what kinds of problems to plan for appropriately.
In mathematics or science, the word model refers to a rule or guideline, often expressible in math or code, that helps to take inputs and give predictions about the way that the world might work in the future. For example, here is a famous model you might recognize:
This is a lovely little model that tells you how much Energy (E) you are likely to get if you convert a given input amount of mass (m) into something super hot and explosive, and the constant c2 tells you that you get really quite a lot of E out for even a little bit of m. This model was created by a smart person who thought carefully for a long time and holds up well in various settings. It doesn’t require a lot of maintenance, and generalizes beautifully to a huge range of settings, even those that were never imagined when it was first created.
The models we typically deal with in machine learning are similar in some ways. They take inputs and give outputs that are often thought of as predictions, using rules that are expressible in mathematical notation or code. These predictions could represent the physical world, like what’s the probability that it will rain tomorrow in Seattle? Or they could represent quantities, like how many units of yarn will sell next month from our online store, yarnit.ai? Or they could even represent abstract human concepts, like is this picture aesthetically pleasing to users on average?
One key difference is that the models we typically use ML for are ones for which we cannot write down a neat little rule like E=mc2, no matter how smart we are. We turn to ML in settings where many pieces of information -- often called features -- need to be taken into account in ways that are hard for humans to specify in advance. Some examples of data that can be processed as features are atmospheric readings from thousands of locations, the color values of thousands of pixels in an image, or the purchase history of all users who have recently visited an online store. When dealing with information with sources that are this complex -- much more than one value for mass and one scaling constant -- it is typically difficult or impossible for human experts to create and verify reliable models that take advantage of the full range of available information. In these cases, we turn to using vast amounts of previously observed data. In using data to train a model, we hope the resulting model will both fit our past data well and also generalize to predict well on new, previously unseen data in the future.
The basic process that is most widely used for creating ML models right now — formally called supervised machine learning — looks like this.
To start, we gather a whole bunch of historical data about our problem area. This might be all of the atmospheric sensor readings from the pacific northwest for the last ten years, or a collection of half a million images, or logs of user browsing history to an online yarnit.ai store. We extract from that data a set of features, or specific, measurable properties of the data. Features represent key qualities of the data in a way that ML models can easily digest. In the case of numerical data, this might mean scaling values down to fit nicely within certain ranges. For less structured data, we might engineer some specific quantities to identify and pull out of raw data. Some examples are input features that might represent things like the atmospheric pressure for each of 1000 sensor locations, or the specific Red, Blue, and Green color values for each pixel location, or a set of features corresponding to each possible product, with a value of 1 if a given user has viewed that product and 0 if they have not.
For supervised ML, we also require a label of some kind, showing the historical outcome that we would like our model to predict, if it were to see a similar situation in the future. This could be the weather result for the given day, like 1 if it rained and 0 if it did not. Or it could be a score attempting to capture whether a given user found an image to be aesthetically pleasing, such as 1 if they gave it a “thumbs up” and 0 if they did not. Or it could be a value showing the number of units of that given yarn product that happened to be sold in the given month. We’ll record the given label for each entry -- and call each one a labeled example.
We’ll then train a model on this historical data, using some chosen model type and some chosen ML platform or service. In the current times, many folks choose to use model types based on Deep Learning, also known as Neural Networks, which are especially effective when given very large amounts of data (think millions or billions of labeled examples). Neural networks build connections among layers of nodes based on the examples that are used in training.
In other settings, methods like Random Forests or Gradient Boosted Decision Trees1 work well when presented with fewer examples. And more complex, larger models are not always preferred, either for predictive power or for maintainability and understandability. Those who are not sure which model type might work best often use methods like AutoML, which train many different model versions and try to automatically pick the best one. Even without AutoML, all models have some adjustments and settings -- called hyperparameters -- that must be set specifically for each task, through a process called tuning but which is really just a fancy name for trial and error. Regardless, at the end, our training process will produce a model, which will take the feature inputs for new examples and produce a prediction as an output. The internals of the model will most likely be treated as a black box that defies direct inspection, often consisting of thousands, millions, or even billions of learned parameters that show how to combine the input features (and potentially re-combine them in many further intermediate stages) to create the final output prediction.
From the perspective of systems engineering, we might notice something disconcerting. Typically, no human knows what the “right” model is. This means that we can’t look at a model produced by training and know if it is good or bad just from our knowledge. The best we can do is run it through various stress tests and validations and see how it performs2, and hope that the stress tests and validations that we come up with are sufficient to capture the range of situations that the model will be asked to handle.
The most basic form of validation is to take some of the training data that we prepared and randomly hold some out to the side, calling it a held out test set. Instead of using this for training, we instead wait and use it to stress test our model, reasoning that because the model has never seen this held out data in training, it can serve well as the kind of previously unseen data for which we hope it will make useful predictions. So we can feed each example in this test set to the model and compare its prediction to the true outcome label to see how well it measures up. It is critical that this validation data is indeed held out from training, because it is all too easy for a model to experience overfitting, which happens when the model memorizes its training data perfectly but cannot predict well on new unseen data.
Once we are happy with the validation of our model, it is time to deploy it in the overall production system. This means finding a way to serve the model predictions as they are required. One way to do this is by precomputing all possible predictions for all possible inputs and then writing those to a cache that our system can query. Another common method is to create a flow of inputs from the production system and transform them into features so that they can be fed to a copy of our model on demand, and then feed the resulting predictions from the model back into the larger system for use. Both of these options can be understood in context in Chapter 7: Serving .
If everything works perfectly from here, we are done. However, the world is full of imperfections, and our models of the world even more so. In the ML Ops world even the smallest of them can potentially create serious issues, and we will need to be prepared. Let’s take a quick tour of some of these areas that can turn into some of these possible problem areas.
The term “model” is often used imprecisely to refer to three separate, but related, concepts:
The general strategy that we are using to learn in this application. This can include both the choice of model family, such as Deep Neural Network or Random Forest, and also structural choices such as how many layers in the DNN or how many trees in the random forest.
The configuration of the model plus the training environment, and the type and definition of data we will train on. This includes the full set of features that are used, all hyperparameter settings, random seeds used to initialize a model, and any other aspect that would be important for defining or reproducing a model. It is reasonable to think of this as the closure of the entire training environment, although many people don’t think through the reliability or systems implications of this (in particular that a very large set of of software and data must be carefully and mutually versioned in order to get reasonable amounts of reproducibility).
A specific snapshot or instantiated representation of the configured model trained on some specific data at a point in time. Note that some of the software that we use in machine learning, especially in distributed deployments, has a fair bit of nondeterminism. As a result the exact same Configured Model trained twice on the exact same data may or may not produce a significantly different Trained Model.
Although in this book we will try to not confuse these concepts, readers should be aware that they are not carefully differentiated in the industry as a whole and we may be occasionally unclear as well. Most people call all three of these concepts “the model”. Even these terms are ours as there is no industry standard terminology to refer to the different uses of “model”.
From a systems reliability standpoint, a system that relies on ML has several areas of vulnerability where things can go wrong. We will touch on a range of them briefly here, with a much deeper dive coming in later chapters on many of these topics. Note that here we are not referring to security vulnerabilities as that term is sometimes used but rather to structural or systemic weaknesses that may be the source of failures or model quality problems. This section is an attempt to enumerate some of the most common ways that models fail.
Training data is a foundational element that defines much of the behavior of our systems. The qualities of the training data establish the qualities and behavior of our models and systems, and imperfections in training data can be amplified in surprising ways. From this lens, the role of data in ML systems can be seen as analogous to code in traditional systems. But unlike traditional code, whose behavior can be specified with preconditions, postconditions, and invariants, and can be rigorously tested, real world data typically has organic flaws and irregularities. These can lead to a variety of problem types.
The first problem to consider is that our training data may be incomplete in various ways. Imagine if we had atmospheric pressure sensors that stopped working at temperatures below freezing, so we had no data on cold days. This would create a blind spot for the model, especially when it was asked to make predictions in this case. One of the difficulties here is that these kinds of problems cannot be detected using a held out test set, because by definition the held out set is a random sample of the data that we already have access to. Detecting these issues requires a combination of careful thought and hard work to gather or synthesize additional data that can help expose these flaws at validation time or correct them at training time.
A special form of incomplete coverage happens when there are correlations in the data that do not always hold in the real world. For example, consider the case where all of the images marked “aesthetically pleasing” happened to include white gallery walls in the background, while none of the images marked “not aesthetically pleasing” did. Training on this data would likely result in a model that showed very high accuracy on held out test data, but was essentially just a white wall detector. Deploying this model on real data could have terrible results in deployment, even while showing excellent performance on held out test data. Again, uncovering such issues requires careful consideration of the training data, and targeted probing of the model with well chosen examples to stress test its behavior in a variety of situations. An important class of such problems can occur when societal factors create a lack of inclusion or representation for certain groups of people; these issues were discussed in Chapter 3, Fairness and Privacy.
Many production modeling systems collect additional data over time in deployment and are retrained or updated to incorporate that data. Such systems can suffer from a cold start problem when there is little initial data. This can happen for systems that were not set up to collect data initially, like if a weather service wanted to use a newly created network of sensors that had never been previously deployed. It can also arise in recommender systems, such as those that recommend various yarn products to users and then observe user interaction data for training. These systems have little or no training data at the very start of their lifecycle, and can also encounter cold start problems for individual items when new products are released over time.
Many models are used within feedback loops, in which they filter data, recommend items, or select actions that then create or influence the model’s training data in the future. A model that helps to recommend yarn products to users will likely only get future feedback on the few near the top of the rankings that are actually shown to users. A conversational agent likely only gets feedback on the sentences that it chooses to create. This can cause a situation in which data that is not selected never gets positive feedback, and thus never rises to prominence in the model’s estimation. Solving this often requires some amount of intentional exploration of lower ranked data, occasionally choosing to show or try data or actions that the model currently thinks are not as good, in order to ensure a reasonable flow of training data across the full spectrum of possibilities.
It is tempting to think of data as a reflection of reality, but unfortunately it is really just a historical snapshot of a part of the world at a particular time. This distinction can appear a little bit philosophical, but becomes mission critical at times when real world events cause changes. Indeed, there may be no better way to convince the executive leadership in an organization of the importance of ML Ops than to ask the question “What would happen to our models if tomorrow there was another COVID-style lockdown?”
For example, imagine if our model were in charge of helping to recommend hotels to users, and based its learning on feedback from previous user bookings. It is easy to imagine a scenario in which a COVID-style lockdown created a sudden sharp drop in hotel bookings, meaning that a model trained on pre-lockdown data was now extremely over-optimistic. As it learned on newer data in which many fewer bookings occurred, it is also easy to imagine that it may do very badly in a world in which lockdowns were then later eased and users wished to book many more hotel rooms again -- only to find that the system recommended none.
These forms of interactions and feedback loops based on real world events are not limited to major world disasters. Here are some others that might occur in various settings:
Election night in a given country causes suddenly different view behavior on videos.
The introduction of a new product quickly causes a spike in user interest on a certain kind of wool, but our model doesn’t have any previous information about it
A model for predicting stock price makes an error, over-predicting for a certain stock. The automated hedge fund using this model then incorrectly buys that stock -- raising the price in the real market and causing other automated hedge fund models to follow suit.
In supervised machine learning, the training labels provide the “correct answer” showing the model what it should be trying to predict for a given example. The labels are critical guidance that show the model what its objective is, often defined as some numerical score. Some examples include:
A score of 1 for “spam” and 0 for “not spam” for an email spam filter model
An amount of daily rainfall, in millimeters, for a given day in Seattle
A set of labels for each possible word that might complete a given sentence, with 1 if that was the actual word that completed the given sentence and 0 if it was any other word
A set of labels for each category of object in a given image, with a 1 if that category of object appears prominently in the image and 0 if it does not
A numerical score showing how strongly a given antibody protein binds to a given virus in a wet lab experiment
Because labels are so important to model training, it is easy to see that problems with labels can be the root cause for many downstream model problems. Let’s look at a few.
In statistical language, the term “noise” is a synonym for errors. If our provided labels are incorrect for some reason, these errors can propagate into the model behavior. Random noise can actually be tolerable in some settings if the errors balance out over time, although it is still important to measure and assess. More damaging can be errors that occur in certain parts of the data, such as if a human labeler consistently mis-identifies frogs as toads for an aquatic image model, or a given set of users is consistently fooled by a certain kind of email spam message, or there is a contamination in a wet lab experiment that keeps a given set of antibodies from binding to a given class of viruses. It is therefore critical to regularly inspect and monitor the quality of the labels and address any issues. In systems in which human experts are used to provide training labels, this can often mean paying excruciating care to documentation of the task specifications and providing detailed training for the humans themselves.
Machine learning training methods tend to be extremely effective at learning to predict the labels we provide -- sometimes so good that they uncover differences between what we had hoped the labels mean and what they actually represent. For example, if our goal is to make customers in the yarnit.ai store happy over time, it can be easy to hope that a “purchase” label correlates with a satisfied user session. This might lead to a model that over-fixates on purchases, perhaps learning over time to promote products that appear to be good deals but in fact are of disappointing quality. As another example, consider the problem of using user clicks as a signal for user satisfaction with news articles -- this could lead to models that highlight salacious “click bait” headlines or even “filter bubble” effects in which users are not shown news articles that disagree with their preconceptions.
Many systems rely on signals from users or observations of human behavior to provide training labels. For example, some email spam systems allow users to label messages as “spam” or “not spam”. It is easy to imagine that a motivated spammer may try to fool such a system by sending many spam emails to accounts under their own control and try to label them as “not spam” in an attempt to poison the overall model. It is also easy to imagine a model that attempts to predict how many “stars” a certain product will receive in user reviews, that they may be potentially vulnerable to bad actors who try to over-rate their own products -- or to under-rate those of their competitors. In such settings, careful security measures and monitoring for suspicious trends is a critical part of long term system health.
In addition to problems with developing a complete and representative data set, or labeling examples correctly, we can encounter threats to the model during the generation model training process.
Some models are trained once and then rarely or never updated. But most models will be updated at some point in their lifecycle. This might happen every few months as another batch of wet lab data comes in from antibody testing, or it might happen every week to incorporate a new set of image data and associated object labels, or it might happen every few minutes in a streaming setting to update with new data based on users browsing and purchasing various yarn products. On average, each new update is expected to improve the model overall -- but in specific cases the model might get worse, or even completely break. Here are some possible causes of such headaches.
As we discussed in the brief overview of a typical model lifecycle, a good model will generalize well to new, previously unseen data and not just narrowly memorize its training data. Each time a model is retrained or updated, we need to check for overfitting using held out validation data. But if we consistently re-use the same validation data, there is a risk that we may end up implicitly overfitting to this re-used validation data. For this reason, it is important to refresh validation data on a regular basis and ensure we are not fooling ourselves. For example, if our validation dataset about purchases on yarnit.ai is never updated, while our customers’ behavior changes over time to favor brighter wools, our models will fail to track this change in purchase preferences because we will score models that learn this behavior as being of “lower quality” than models that do not. It is important that model quality evaluation include real-world confirmation of a model’s performance.
Each time a model is retrained, there is no guarantee that its predictions are stable from one model version to another. That is, one version of a model might do a great job on recognizing cats and poorly at dogs, while another version might be quite a bit better on dogs and less good on cats -- even if both models have similar aggregate accuracy on validation data. In some settings this can become a significant problem. For example, imagine a model that is used to detect credit card fraud and shut down credit cards that may have been compromised. A model with 99% precision might be very good in general, but if the model is retrained each day and makes mistakes on a different 1% of users each day, then after three months potentially the entire user base may have been inconvenienced by faulty model predictions. It is important to evaluate model quality on relevant subsections of the predictions that we care about.
Deep learning methods have become so important in recent years due to their ability to achieve extremely strong predictive performance in many areas. However, deep learning methods also come with a specific set of fragilities and potential vulnerabilities. Because they are so widely used, and because they have specific peculiarities and concerns from an ML Ops perspective, we will go into some detail here on deep learning models in particular.
When deep learning models are trained from scratch -- with no prior information -- they begin from a randomized initial state and are fed a huge stream of training data in randomized order, most often in small batches. The model makes its best current prediction (which early on is totally terrible since the model has not learned very much yet) and then is shown the correct label. Once the math is calculated -- computing the loss gradient -- the internals of the model should be updated. When using this loss gradient, a small update is applied to the model internals that tries to make the model slightly better. This process of small corrections on randomly ordered mini batches is called Stochastic Gradient Descent (SGD) and is repeated many millions or billions of times. We stop training once we decide that the model shows good performance on held out validation data.
The key insights about this process are:
There is the initial random state, data is shown in random order. In large scale parallelized settings there is even randomness inherent in the way that updates are processed due to network and parallel computation effects. This means that repeating the process of training the same model with the same settings on the same data can lead to substantially different final models.
The model performance on held out validation data generally improves with additional training steps, but sometimes bounces around early on, and later on can often get significantly worse if the model starts to overfit the training data by memorizing it too closely. We stop training and choose the best model that we can when performance converges to a good level, often choosing a checkpoint version that shows good behavior at some intermediate point. Unfortunately, there is no formal way to know if the performance we see now is the best we could get if we were to let training continue on further, and there is indeed a double dip phenomenon that was discovered relatively recently3 for a wide range of models that shows that our previous conceptions of when to stop may not have been optimal.
The reason that we take little steps instead of big ones is that it is easy to fall off a mathematical cliff and put the model internals into a state from which it is hard to recover. Indeed, the phrase exploding gradients is a technical term that does indeed connote the appropriate danger. Models that end up in this state often give NaN (not a number) values in predictions or intermediate computations. This behavior can also surface as the model’s validation performance suddenly worsening.
As has been mentioned, hyper-parameters are the various numeric settings that must be tuned for an ML model to achieve best performance on any given task or data set. The most obvious of these is the learning rate which controls how small each update step should be. The smaller the steps, the less likely the model is to explode or result in strange predictions, but the longer training takes, the more computation is used. There are other settings that also have significant effects, like how large or complex the internal state is, how large the little batches are, and how strongly to apply various methods that try to combat overfitting. Deep learning models are notoriously sensitive to such settings, which means that significant experimentation and testing is required.
The training methodology of SGD is effective at producing good models, but relies on a truly enormous number of tiny updates. Indeed, the amount of computation used to train some models can look like thousands of cores running continuously for several weeks.
Deep learning methods extrapolate from their training data, which means that the more unfamiliar a new previously unseen data point is, the more likely we are to have an extreme prediction that might be completely off base or out of the range of typical behavior. The sorts of highly confident errors can be a significant source of system level misbehavior.
Now that we understand the structure of what can go wrong with model creation, it might be useful to take a broader perspective on the infrastructure required to train models in the first place.
Models are just one component in larger ML systems, which are typically supported by significant infrastructure to support model training, validation, serving, and monitoring in one or more pipelines. These systems thus inherit all of the complexities and vulnerabilities of traditional (non-ML) pipelines and systems in addition to those of ML models. We will not go into all of those traditional issues here, but will highlight a few areas in which traditional pipeline issues come to the foreground in ML-based systems.
Modern ML systems are often built on top of one (or more!) ML frameworks, such as TensorFlow, PyTorch, or scikit-learn or even an integrated platform such as AzureML, SageMaker, or VertexAI. From a modeling perspective, these platforms allow model developers to create models with a great degree of speed and flexibility, and in many cases to enable the use of hardware accelerators such as GPUs and TPUs or cloud-based computation without a lot of extra work.
From a systems perspective, the use of such platforms introduces a set of dependencies that is typically outside our control. We may be hit with package upgrades, fixes that may not be backwards compatible, or components that may not be forwards compatible. Bugs that are found may be difficult to fix, or we may need to wait for the platform owners to prioritize accepting the fixes we propose. On the whole, the benefits of using such frameworks or platforms nearly always outweighs these drawbacks, but there are still costs that must be considered and factored into any long term maintenance plan or ML Ops strategy.
An additional consideration is that because these platforms are typically created as general purpose tools, we will typically need to create a significant amount of adapter components or glue code that can help us transform our raw data or features into the correct formats to be used by the platform, and to interface with the models at serving time. This glue code can quickly become significant in size, scope, and complexity, and is important to support with testing and monitoring at the same level of rigor as the other components of the system.
Extracting informative features from raw input data is a typical part of many ML systems, and may include such tasks as:
Tokenizing words in a textual description
Extracting price information from a product listing
Binning atmospheric pressure readings into one of five coarse buckets (often referred to as quantization)
Looking up time since last login for a user account
Converting a system timestamp into a localized time of day
Most such tasks are straightforward transformations of one data type to another. They may be as simple as dividing one number by another or involve some complex logic or requests to other subsystems. In any case, it is important to remember that bugs in feature generation are arguably the single most common source of errors in ML systems.
There are several reasons why feature generation is such a hotspot for vulnerabilities. The first is that errors in feature generation are often not visible by aggregate model performance metrics, such as accuracy on held out test data. For example, if we have a bug in the way that our temperature sensor readings are binned, it may reduce accuracy by a small amount, but our system may also learn to compensate for this bug by relying more on other features. It can be surprisingly common for bugs to be found in feature generation code that has been running in production, undetected, for months or years.
A second source of feature generation errors occurs when the same logical feature is computed in different ways at training time and serving time. For example, if our model relies on a localized time of day for an on-device model, that may be computed via a global batch processing job when training data is computed, but it may be queried directly from the device at serving time. If there is a bug in the serving path, this can cause errors in prediction that are difficult to detect due to the lack of ground truth validation labels. We cover a set of monitoring best practices for this important case in <CHAPTER X>.
The third major source of feature generation errors is when our feature generators rely on some upstream dependency that becomes buggy or is hit with an outage. For example, if our model for generating yarn purchase predictions depends on a lookup query to another service that reports user reviews and satisfaction ratings, our model would have serious problems if that service suddenly went offline or stopped returning sensible responses. In real world systems, our upstream dependencies often have upstream dependencies of their own, and we are indeed vulnerable to the full stack of them all.
One particularly subtle area where upstream dependencies can cause issues in our models is when the upstream system undergoes an upgrade or bug fix. It may seem strange to say that fixing bugs can cause problems. The principle to remember is better is not better, better is different — and different may be bad.
This is because any change to the distribution of feature values that our model expects to see associated with certain data may cause erroneous behavior. For example, imagine that the temperature sensors we use in a weather prediction model had a bug in the code that reported degrees in Fahrenheit when in fact it was supposed to be reporting degrees in Celsius. Our model would learn that 32 degrees is freezing, and 90 degrees is a hot summer day near Seattle. If some clever engineer notices this bug and fixes the temperature sensor code to send Celsius values instead, then the model would be seeing values of 32 degrees and assuming the world was icy cold when in fact it was hot and sunny.
There are two key defenses for this form of vulnerability. The first is to arrange for a strong level of agreement with upstream dependencies about being altered to such changes before they happen. The second is to create monitoring of the feature distributions themselves, and alert on change. This is discussed in more depth in Chapter 9: Running and Monitoring.
Researchers and academics tend to focus on the mathematical qualities of an ML model. In ML Ops, we can find value in a different set of questions that will help us understand where our models and systems can go wrong, how we can fix issues when they occur, and how we can preventatively engineer for long term system health.
This question is meant conceptually -- we need to have a full understanding of the source of the training data and what it is supposed to represent. If we are looking for email spam, do we get access to the routing information and can it be manipulated by bad actors? If we are modeling user interactions with yarn products, what order are they displayed in and how does the user move through the page? What important information do we not have access to, and what are the reasons? Are there any policy considerations around data access or storage that we need to take into account, especially around privacy, ethical considerations, or legal or regulatory constraints?
This is the more literal side of the previous question. Is the data stored in one large flat file, or sharded across a datacenter? What access patterns are most common, or most efficient? Are there any aggregations or sampling strategies applied that might reduce cost but lose information? How are privacy considerations enforced? How long is a given piece of data stored and what happens if a user wishes to remove their data from the system? And how do we know that the data that is stored has not been corrupted in some way, that the feeds have not been incomplete, and what sanity checks and verifications can we apply?
Features, the information we extract from raw data to enable easy digestion for ML models, are often added by model developers with a “more is always better” approach. From an ops perspective, we need to maintain a complete understanding of each individual feature, how it is computed, and how the results are verified. This is important because bugs at the feature computation level are arguably the most common source of problems at the system level. At the same time these are often the most difficult to detect by traditional ML verification strategies -- a held out validation set may be impacted by the same feature computation bug. As suggested above, most insidious are issues that occur when features are computed by one code path at training time -- for example, to optimize for memory efficiency -- and at serving time for real deployment are computed by another code path -- for example, to optimize for latency. In such cases, the model’s predictions can go awry, but we may have no ground truth validation data to use to detect this.
Some models are updated very rarely, such as models for automated translation that are trained on huge amounts of data in large batches, and pushed to on-device applications once every few months. Others are updated extremely frequently, such as an email spam filter model that must be kept constantly up to date as spammers evolve and develop new tricks to try and avoid detection. However, it is reasonable to assume that all models will eventually need to be updated, and we will need to have structures in place that ensure a full suite of validation checks before any new version of a model is allowed to go live. We will also need to have clear agreements with the model developers in our organization about who makes judgement calls about sufficient model performance, and how problems in predictive accuracy are to be handled.
Our ML systems are important, but as with many complex data processing systems, they are typically only one part of a larger overall system, service, or application. We must have a strong understanding of how our ML system fits into the larger picture in order to prevent issues and diagnose problems if they arise. We need to know the full set of upstream dependencies that provide data to our model, both at training time and serving time, and know how they might change or fail and how we might be alerted if this happens. Similarly, we need to know all of the downstream consumers of our model’s predictions, so that we can appropriately alert them if our model should experience issues. We also need to know how the model’s predictions impact the end use case, if the model is part of any feedback loops (either direct or indirect), and if there are any cyclic dependencies such as time of day, day of week, or time of year effects. Finally, we need to know how important model qualities like accuracy, freshness, and prediction latency are within the context of the larger system, so that we can ensure these system level requirements are well established and continue to be met by our ML system over time.
Perhaps most importantly, we need to know what happens to the larger system if the ML model fails in any way, or if it gives the worst possible prediction for a given input. This knowledge can help us to define guard rails, fallback strategies, or other safety mechanisms. For example, a stock price prediction model could conceivably cause a hedge fund to go bankrupt within a few milliseconds -- unless specific guard rails were put in place that limited certain kinds of buying actions or amounts.
To help ground our introduction to basic models, we will walk through some of the structure of an example production system. We will go through enough detail here so that we can start to see answers to some of the important questions listed above, but we will also dive deeply into specific areas for this example in later chapters.
In our imaginary yarnit.ai store, there are many areas where ML models are applied. One of these is in predicting the likelihood that a user will click on a given yarn product listing. In this setting, well calibrated probability estimates are useful for ranking and ordering different possible products, including skeins of yarn, various knitting needle types, patterns, and other yarn and knitting accessories.
The model used in this setting is a deep learning model that takes as input the following set of features:
These include tokenized words of text, but also specifically identified characteristics such as amount of yarn, size of needles, and product material. Because there are a wide variety of ways that these characteristics are expressed in product descriptions from different manufacturers, each of these characteristics is predicted by a separate component model specially trained to identify that characteristic from product description text. These models are owned by a separate team, and provided to our system via a networked service that has occasional outages.
The raw product image is supplied to the model, after being first normalized to a 32 x 32 pixel square format by squishing the image to fit a square and then averaging pixel values to create the low-resolution approximation. In previous years, most manufacturers provided images that were nearly square, and with the product well centered in the image. More recently some manufacturers have started providing images with much wider “landscape” format that must be squished significantly more to become square, and the product itself is often shown in additional settings rather than on a plain solid color background.
The user may enter search queries such as “thick yellow acrylic yarn” or “wooden size 8 needles”, or may arrive at a given page by having clicked on various topic headings, navigation bars, or suggestions listed on the previous page.
Because products that appear higher in the listed results are more likely to be viewed and clicked than products listed further down, it is important at training time to have features that show where the product was listed at the time that the data was collected. Note, however, that this introduces a tricky dependency -- we cannot know at serving time what the value of these features are, because the ranking and ordering of the results on the page depends on our model’s output.
The training labels for our model are a straightforward mapping of 1 if the user clicked on the product and 0 if the user did not click on the product. However, there is some subtlety to consider in terms of timing -- users might click on a product several minutes or even a few hours after a result was first returned to them if they happened to get distracted in the middle of a task and return to it later. For this system, we allow for a one hour window.
Additionally, we have detected that some unscrupulous manufacturers attempt to boost their product listings by clicking on their products repeatedly. Other, more nuanced but equally ill intentioned manufacturers attempt to lower their competitors listings by issuing many queries that place their competitors listings near the top without clicking on them. Both of these are attempts at fraud or spam and need to be filtered out before the model is trained on this data. This filtering is done by a large batch processing job that runs every few hours over recent data to look for trends or anomalies, and to avoid complications we simply wait until these jobs have completed before incorporating any new data into our training pipeline. Sometimes, however, these filtering jobs fail and this can introduce significant additional delays into our system.
Our model is often described to executives as “continually updating”, in order to adapt to new trends and new products. However, there are some delays inherent in the system. First, we need to wait an hour to see if a user has really not clicked on a given result. Then we need to wait while upstream processes filter out as much spam or fraud as possible. Finally, the feature extraction jobs that create the training data require batch processing resources, and introduce several more hours of delay. In practice then, our model is updated about 12 hours after a given user has viewed or clicked on a given product. This is done by incorporating batches of data that update the model from its most recent checkpoint.
Fully retraining the model from scratch requires going back to historical data and revisiting that data in order, with the goal of mimicking the same sequence of new data coming in over time, and is done from time to time by model developers when new model types or additional features are added to the system. For more details on this, see <CHAPTER X>.
Fortunately, our online store is quite popular -- we get a few hundred queries per second, which is not shabby for the worldwide knitting products market. For every query, our system needs to quickly score candidate products so that a set of product listings can be returned in the two to three tenths of a second before a user begins to perceive waiting time. Our ML system, then, needs to have low serving latency. To optimize for this, we create a large number of serving replicas in a cloud-based computation platform, so that requests are not queued up unnecessarily. Each of these replicas reloads to get the newest version of the model every few hours. We will talk more about the ins and outs of model serving in <CHAPTER X>.
Because the model is refreshed on the go and serves live, our system has several areas that need to be monitored to ensure that performance in real time continues to be good. Some of these include:
Is the model actually getting updated on a regular basis? Are the new checkpoints being successfully picked up by the serving replicas?
Over time, do the aggregate statistics look roughly in line with recent history? We can look at basic verificables, such as the number of predictions made per minute, and the average prediction values. We can also monitor whether the number of clicks that we predict over time matches the number of clicks we then later actually see.
Our input features are the lifeblood of our system. We monitor basic aggregate statistics on each input feature to ensure that these remain stable over time. Sudden changes in the values for a given feature may indicate some upstream outage or some other issue that needs to be managed.
Of course, this is just a starter set -- we will go into significantly more details on options for monitoring in Chapter 9: Running and Monitoring.
It is useful to think through worst case scenarios so that we can be prepared. Each of these tends to be related to the overall product needs and requirements, because in the end it is the impact to our overall product ecosystem that really matters.
One bad thing that could happen is that our model never returns values. Maybe the serving replicas are overloaded or are all down for some reason. If the system hangs indefinitely, users will never get any results. A basic timeout and a reasonable fallback strategy will help a lot here. One fallback strategy can be to use a much cheaper, simpler model in case of timeouts. Another might be to precompute some predictions for the most popular items and use these as fallback predictions.
If our model were to score zero for all products, none would be shown to users. This would of course indicate some issue with the model, but we will need to monitor average predictions and fall back to another system if things go awry.
Imagine that our model is corrupted, or an input feature signal becomes unstable. In this case, we might get arbitrary or random predictions, which would confusingly show random products to users, creating a poor user experience. To counter this, one approach might be to monitor both the aggregate variance of predictions along with the averages. This can happen for just a subset as well, so we may need to monitor predictions for relevant subsets of the predictions.
Imagine that a few products happen to get very high predictions and everything else gets low predictions. Naively, this might create a setting in which those other products, or new products over time, never get a chance to be shown to users and thus never get click information that would help them rise in the rankings. A small amount of randomization can be helpful in these situations to ensure that the model gets some amount of exploration data, which will allow the model to learn about products that had not previously been shown. This form of exploration data is also useful for monitoring, as it allows us to reality check the model’s predictions and ensure that when it says a product is unlikely to be clicked that this holds true in reality as well.
There are clearly many other things that might go wrong -- and fortunately ML Ops folks tend to have strong and active imaginations in this regard. Thinking through scenarios in this way allows us to prepare our systems in a robust way and ensure that any failures are not catastrophic and can be quickly addressed.
In this chapter, we have gone through some of the basics of ML Models and how they fit in to overall systems that rely on ML. Of course, we have only begun to scratch the surface. In later chapters, we will dive into more depth on several of the topic areas that we touched on.
Here are a few things that we hope you take away:
Model behavior is determined by data, rather than by a formal specification.
If our data changes, our model behavior will change.
Features and training labels are critical inputs to our models. We should take the time to understand and verify every input.
ML Models may be black boxes, but we can still monitor and assess their behavior.
Disasters do happen, but their impact can be minimized with careful forethought and planning.
ML Ops folks need to have strong working relationships and agreements with model developers in their organization
1 Many ML production engineers or SREs do not need to learn the full details of how Neural Networks, Random Forests or Gradient Boosted Decision Trees work (although doing so is not as scary or difficult as it seems at first). Reliability engineers working with these systems do, however, need to learn the systems requirements and typical performance of these systems. For this reason we cover the high level structure here.
2 The evaluation of model quality or performance is a complex topic in its own right that will be dealt with in Chapter 11: Evaluation and Model Quality.
3 OpenAI published a straightforward description of this phenomenon along with helpful references to the source papers at: https://openai.com/blog/deep-double-descent/ .