In this chapter, we will discuss the ensemble model serving pattern. In the ensemble pattern, we combine the output from multiple models before serving a response to the client. This combination of responses from multiple sources is needed in many scenarios – for example, to get information about audio and video using separate models from a video file, and then combining that information to generate the final inference about the video. We can also combine the output from multiple similar models to make inferences with higher confidence. We will discuss some of these cases in this chapter. We will also explore a dummy end-to-end example of how we can combine multiple models to generate the final response.
At a high level, we are going to cover the following main topics in this chapter:
In this chapter, we will mostly use the same libraries that we have used in previous chapters. You should have Postman or another REST API client installed to be able to send API calls and see the response. All the code for this chapter is provided at this link: https://github.com/PacktPublishing/Machine-Learning-Model-Serving-Patterns-and-Best-Practices/tree/main/Chapter%2010.
If you ModuleNotFoundError appears while trying to import any library, then you should install the module using the pip3 install <module_name> command.
In this section, we will discuss the ensemble pattern of serving models and the different types of ensembles that can be used to serve a model.
In the ensemble pattern of serving models, more than one model is served together. In an ensemble pattern, an inference decision is made by combining the inferences from all the models in the ensemble.
The final response, , from the input, , will be generated as a combined inference from the models , , , as shown in the following equation:
In this equation, is the response and is the combination function that combines the responses from all the models. , , are different models and is the input passed to the models.
We can ensemble multiple models for various scenarios. The first four of these different types are introduced in this article: https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns. The following scenarios are examples of where the ensemble pattern can be of great use:
To learn more about staged rollouts, please follow this link: https://developerexperience.io/practices/staged-rollout. Providing a response to clients from a non-production model instead of a production model is known as traffic shadowing in software engineering.
To learn more about traffic shadowing, please follow this link: https://www.getambassador.io/docs/edge-stack/latest/topics/using/shadowing. This traffic shadowing pattern in a production environment helps to provide nearly zero timeouts of the production application and test the updated model sufficiently before releasing it to production.
In this section, we have introduced you to the ensemble pattern and discussed different kinds of ensemble approaches that can be used. In the next section, we will discuss these approaches in the ensemble pattern along with examples.
In this section, we will discuss different types of ensemble approaches along with examples. We have seen that we can combine the models in five types of different scenarios. The following subsections will discuss them one by one.
In the machine learning (ML) deployment life cycle, updating the model happens regularly. For example, we might have to update a model for route planning if new roads and infrastructure are built or removed. Whenever a model needs to be replaced, it might be risky to replace the current model directly. If for some reason, the new model performs poorly compared to the previous model, then it might cause critical business problems and loss of trust. For example, let’s imagine we have updated a model with a new version tag, V2, that predicts a stock price. The V1 model version was predicting stock prices with an MSE of 10.0. Although during training the V2 model was performing very well, in production, we noticed that the V2 model was giving an MSE of 20.0. Therefore, if we directly deployed model V2 in replacement of V1, we could lose customer trust. By following the ensemble process of updating the model, we can avoid this risk.
Therefore, in this case, we keep both the old model and the new model in the production system. The models perform the following operations for a certain period that we can call the evaluation period:
The results from the new and old models might match in most cases. However, for some inputs, the response might not match. In that case, we have to manually verify which model gave the correct output and then, finally, compute which model showed better accuracy in the differing responses. For example, let’s look at the dummy data shown in Figure 10.1:
Actual level |
Prediction by old model |
Prediction by new model |
Class A |
Class A |
Class A |
Class B |
Class B |
Class B |
Class A |
Class A |
Class A |
Class B |
Class B |
Class B |
Class A |
Class A |
Class A |
Class B |
Class B |
Class B |
Class A |
Class B |
Class A |
Class A |
Class B |
Class A |
Class A |
Class B |
Class A |
Class B |
Class B |
Class A |
Figure 10.1 – Dummy data showing actual levels and predictions from the old model and the new model
In the table in Figure 10.1, the first column shows the actual level, the second column shows the predictions made by the old model, and the third column shows the predictions made by the new model during an evaluation period. The bold rows in the table are the rows where the predictions from the old model and the new model are different. We can see that the old model only made one correct prediction in these four rows and the new model made three correct predictions, so, the prediction accuracy of the new model is ¾ = 75%, and the prediction accuracy of the old model is ¼ = 25%. Therefore, we can decide to use the new model, as the accuracy is satisfactory compared to the old model.
The following code snippet shows a dummy example of using two models as an ensemble while updating:
def load_model(filename): print("Loading the model:", filename) def predict_model_current(X): model = load_model("model_update/model_current/model.txt") print("Current model is predicting for ", X) return "dummy_pred_current" def predict_model_new(X): model = load_model("model_update/model_new/model.txt") print("New model is predicting for ", X) return "dummy_pred_new" def predict(evaluation_period, X): if evaluation_period == True: pred_current = predict_model_current(X) pred_new = predict_model_new(X) file = open("evaluation_data.csv", "a") file.write(f"{pred_current}, {pred_new}") return pred_current else: return predict_model_current(X)
In our code, we have not used an actual model; rather, we have stored the dummy text file in the folders meant to save the current and new models. The directory structure of the storage is shown in Figure 10.2. We can see from the figure that we have stored just two text files. You will store two different models here after the training:
Figure 10.2 – Directory structure of storing two parallel models to use as an ensemble during model update
As shown in the code snippet, during the evaluation period, we will get a prediction from both of these models. However, we only return pred_current from the current model to the user. We save both of the predictions to a separate file that will be used for evaluation later on. We write the predictions that we got from the last code snippet to a CSV file using the following code:
file = open("evaluation_data.csv", "a") file.write(f"{pred_current}, {pred_new}")
Then, we can manually move the new model to the current model directory if the evaluation results succeed. Note that in the predict method, we use both models. The first model is used if the condition of the if statement is true; otherwise, we use the other model inside the else block as seen from the predict method code snippet. Predictions from the current model are accessed using the predict_model_current API and predictions from the new model are accessed using the predict_model_new API.
This is how ensembles work during a model update. Although during the actual deployment, there will be actual models instead of dummy files and the locations can also be two totally different servers, the concept of model update using the ensemble pattern will remain the same.
In this approach, we aggregate the response from multiple models and send the aggregated response to the users. Aggregation usually happens in the following two ways for regression and classification problems, respectively:
Let’s assume that we have three models, M1, M2, and M3, served in the ensemble pattern to predict the price of a stock. Let’s say the feature set of the stock is X and we need to predict the price with this feature set. The responses from the models are as follows:
Y1 = M1(X) Y2 = M2(X) Y3 = M3(X)
Therefore, the final response that will be returned to the user is the following:
Y = (Y1 + Y2 + Y3)/3
Usually, averaging the responses helps provide a more accurate prediction compared to the prediction from a single model. Therefore, the response from the ensemble pattern creates more confidence among the customers.
For example, let’s say the actual price of the stock is $10. The M1, M2, and M3 models made predictions of $8, $12, and $13, respectively.
The average of these three predictions is (8 + 12 + 13)/3 = 11.
We notice that although the predictions made by the models differ by -$2, $2, and $3, respectively, the aggregated response differs by $1, so we got a response that is closer to the actual price compared to the predictions made by the individual models. It’s not necessary for the aggregate response to always perform better than all the individual models, but most of the time, the average prediction will outperform the individual predictions. For example, if the three models make predictions of $11, $10, and $11, respectively, the average response is ~$10.67. Although this is better than the prediction from the first and third models, it is not better than the prediction from the second model.
Besides the aggregation techniques we just defined, there can be many other aggregation techniques for both regression and classification tasks. For example, instead of taking the direct mean, we can use the geometric mean for aggregating the results of multiple models, we can use weighted means to provide different weights to different models, and so on. For classification, we can also use different variants of majority class selection algorithms, weighted majority algorithms, and so on.
In the next sub-section, we will discuss selecting the majority class in the classification problem using the Boyer-Moore algorithm.
For classification problems, we can use the majority voting algorithm to select the class. For example, let’s say we have three classification models: M1, M2, and M3. The models make predictions, as follows, for an input, X:
C1 = M1(X) C1 = M2(X) C2 = M3(X)
Now, we can use the majority voting algorithm to find out which class has been predicted by the highest number of models. In the preceding example, the majority class is C1, which is predicted by two out of the three models. One version of the majority algorithm is known as the Boyer-Moore majority voting algorithm, which finds an element that appears at least N/2 times out of the N numbers in a list or sequence.
For example, if the input array is [1 1 1 3], the Boyer-Moore algorithm will say the majority element is 1, as it appeared more than 4/2 = 2 times. Here, N = 4 is the number of elements in the array.
Boyer-Moore
To learn more about the Boyer-Moore algorithm, you can check out the following link: https://www.geeksforgeeks.org/boyer-moore-majority-voting-algorithm/.
In our aggregation strategy, the Boyer-Moore approach will not always work. For example, let’s assume we have six models and they make predictions of [C1 C1 C2 C3 C4 C5]. Here, none of the elements appeared more than 6/2 = 3 times. We still need to provide the prediction as C1, as it appeared the highest number of times. Therefore, we can use a majority voting algorithm that will select the class that has been predicted by the highest number of models. An example of finding the majority class is as follows:
from collections import Counter x = ['C1', 'C1', 'C1', 'C2', 'C2', 'C3', 'C3'] counts = Counter(x) print("Counts of different elements", counts) major_element = counts.most_common(1)[0][0] print("Major element", major_element)
The output of the preceding program is as follows:
Counts of different elements Counter({'C1': 3, 'C2': 2, 'C3': 2}) Major element C1
In this program, first, we compute the count of different classes in the array and then select the class that has the highest count using the major_element = counts.most_common(1)[0][0] line. Here, 1 is passed as an argument to the most_common(n) method to select the top most common element. This method returns an array of tuples, so we select the first element of the first tuple in the array using the [0][0] indices.
We can serve different models that have been specialized for different problems following the ensemble pattern. For example, we can have a model to detect the names of different fruits. Different fruits have different features and therefore the prediction task of the fruits can be seen as separate problems; we can have separate models to solve each of those problems. We can serve these models together to form a complete fruit detection system.
Based on a feature in the input, we will select the appropriate model. For example, let’s say we want to design an ML system that can detect the class of a fruit and the class of a flower. The features for flower detection and fruit detection will be different. There can be two different models for handling the inference for these two different inputs: one for flowers and the other for fruit. An example of this is shown in Figure 10.3, where we select the model to detect either a flower or fruit based on the input:
Figure 10.3 – Ensemble pattern serving two models with the option to select a particular model
We can aggregate the responses from more than one model for flowers, and more than one model for fruits, which makes serving more complicated. Therefore, instead of a single model for a particular problem, we will have multiple models. The responses will be aggregated before providing an overall response to the client.
ML is acquiring more and more responsibilities as time goes on. We now use ML models to describe pictures and contexts, drive autonomous vehicles, and so on. In many of these scenarios, we might need the responses from multiple models to be combined. For example, let’s say we are describing a painting and we want the following descriptions:
We might create a separate model for each of these cases and then combine the responses to provide an overall summary of the painting.
In this section, we have seen different techniques in the ensemble pattern. In the next section, we will discuss an end-to-end example of serving two regression models together, and then we will conclude the chapter.
In this section, we will create an end-to-end dummy example of serving the two regression models together, and then we will combine their responses by averaging them. The models we will use are the following:
Let’s describe the process step-by-step:
Note
Please keep in mind that the output may be different in your case from the following steps, as you will train the model with some random data generated using the make_regression function.
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
from sklearn.datasets import make_regression
import pickle
X, y = make_regression(n_features=2, random_state=0, shuffle=False, n_samples=20)
model1 = RandomForestRegressor(max_depth=2)
model1.fit(X, y)
print(model1.predict([[0, 0]]))
pickle.dump(model1, open("rfreg.pb", "wb"))
model2 = AdaBoostRegressor(n_estimators=5)
model2.fit(X, y)
pickle.dump(model2, open("adaboostreg.pb", "wb"))
from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
import pickle
model1: RandomForestRegressor = pickle.load(open("rfreg.pb", "rb"))
model2: AdaBoostRegressor = pickle.load(open("adaboostreg.pb", "rb"))
def predict_model1(X):
response = model1.predict(X)
print("Response from model 1 is", response)
return response
def predict_model2(X):
response = model2.predict(X)
print("Response from model 2 is", response)
return response
def predict(X):
response1 = predict_model1(X)
response2 = predict_model2(X)
response = (response1 + response2)/2
print("Final response is ", response)
predict([[0, 0]])
In this code, we have two methods to get predictions from two different models. Then, we have the predict method, which will be called by the client through an API. We combine the responses from the two models inside this method and print that final response. If we call the predict method with [[0, 0]], we will get the following output in the console:
Response from model 1 is [8.1921427] Response from model 2 is [-11.31397077] Final response is [-1.56091404]
Have a look; we get responses from both models and then average the responses to provide the final response.
In this section, we have discussed a dummy example to demonstrate how can we serve the models in the ensemble pattern. In the next section, we will summarize the chapter and conclude.
In this chapter, we have discussed the ensemble pattern of model serving. We were introduced to the ensemble pattern and different types of approaches to using it.
We have discussed how this pattern can be of use when we need to carefully update a new model, when we need predictions from multiple models to increase the prediction accuracy, when we need an option for multiple models based on different inputs, and when we need to combine the responses from multiple models to produce a final output.
In the next chapter, we will discuss the business logic pattern to serve ML models. We will discuss how while serving an ML model, we might need different business logic, such as user authentication or querying a database.
3.19.244.187