10

Ensemble Model Serving Pattern

In this chapter, we will discuss the ensemble model serving pattern. In the ensemble pattern, we combine the output from multiple models before serving a response to the client. This combination of responses from multiple sources is needed in many scenarios – for example, to get information about audio and video using separate models from a video file, and then combining that information to generate the final inference about the video. We can also combine the output from multiple similar models to make inferences with higher confidence. We will discuss some of these cases in this chapter. We will also explore a dummy end-to-end example of how we can combine multiple models to generate the final response.

At a high level, we are going to cover the following main topics in this chapter:

  • Introducing the ensemble pattern
  • Using ensemble pattern techniques
  • End-to-end dummy example of serving the model

Technical requirements

In this chapter, we will mostly use the same libraries that we have used in previous chapters. You should have Postman or another REST API client installed to be able to send API calls and see the response. All the code for this chapter is provided at this link: https://github.com/PacktPublishing/Machine-Learning-Model-Serving-Patterns-and-Best-Practices/tree/main/Chapter%2010.

If you ModuleNotFoundError appears while trying to import any library, then you should install the module using the pip3 install <module_name> command.

Introducing the ensemble pattern

In this section, we will discuss the ensemble pattern of serving models and the different types of ensembles that can be used to serve a model.

In the ensemble pattern of serving models, more than one model is served together. In an ensemble pattern, an inference decision is made by combining the inferences from all the models in the ensemble.

The final response, , from the input, , will be generated as a combined inference from the models , , , as shown in the following equation:

In this equation, is the response and is the combination function that combines the responses from all the models. , , are different models and is the input passed to the models.

We can ensemble multiple models for various scenarios. The first four of these different types are introduced in this article: https://www.anyscale.com/blog/serving-ml-models-in-production-common-patterns. The following scenarios are examples of where the ensemble pattern can be of great use:

  • Model update: When updating a model for a sensitive business scenario where a sudden performance drop cannot be afforded, we can use the ensemble pattern. We can run the old model and the new, updated models together for a certain period. During that period, we will keep providing responses to the customer from the old model and use the responses from the new model to verify its performance compared to the old model. When we are confident about the performance of the new model, we can then replace the old model with the new one. In this case, although the final response is still coming from a single model, more than one model is being used and the responses from the other models are utilized to complete the updated process. This method is widely known as staged rollout in software engineering. In a staged rollout, the updated software is only made available to a small group of users instead of providing the update for all users instantly.

To learn more about staged rollouts, please follow this link: https://developerexperience.io/practices/staged-rollout. Providing a response to clients from a non-production model instead of a production model is known as traffic shadowing in software engineering.

To learn more about traffic shadowing, please follow this link: https://www.getambassador.io/docs/edge-stack/latest/topics/using/shadowing. This traffic shadowing pattern in a production environment helps to provide nearly zero timeouts of the production application and test the updated model sufficiently before releasing it to production.

  • Aggregation: To provide responses to users with higher confidence, we can aggregate the responses from similar models. For example, for a classification problem, we can take predictions from different models trained using the same training data and then use the majority vote (https://en.wikipedia.org/wiki/Boyer%E2%80%93Moore_majority_vote_algorithm) algorithm to find the majority class predicted by the models. Then, we can provide that majority class as the final prediction response to the client. For regression problems, instead of majority voting, we can take an average of all the predictions from different models and then provide the response to the clients.
  • Model selection: We can serve multiple models and, based on some feature in the input, select a particular prediction model. For example, let’s say we have models for predicting the class of different types of objects, such as flowers or animals. We can have separate models for each of these objects and, based on the input, we can select the appropriate model.
  • Combining different responses: For tasks such as describing objects, we can use multiple models to describe different aspects of the object. For example, let’s say we are describing different features of a house, such as its height and color. We can use multiple models for each of the subanalysis tasks and then combine the responses to provide a full description.
  • Serving degraded response: In this kind of ensemble, usually, two models are served in parallel. One model is strong and can provide a more accurate response but can time out because the model usually has time-consuming computations. Another model is served in parallel, which is simple but can provide a response very quickly – although it may be less trustworthy. If a client request comes, we send the request to both models. If the robust model times out, we send the degraded response from the weaker model. For example, let’s say we have a system for language translation. We might serve a strong model in parallel with a weak model. If the strong model times out when handling a client request, we are able to provide a degraded response from the weaker model.

In this section, we have introduced you to the ensemble pattern and discussed different kinds of ensemble approaches that can be used. In the next section, we will discuss these approaches in the ensemble pattern along with examples.

Using ensemble pattern techniques

In this section, we will discuss different types of ensemble approaches along with examples. We have seen that we can combine the models in five types of different scenarios. The following subsections will discuss them one by one.

Model update

In the machine learning (ML) deployment life cycle, updating the model happens regularly. For example, we might have to update a model for route planning if new roads and infrastructure are built or removed. Whenever a model needs to be replaced, it might be risky to replace the current model directly. If for some reason, the new model performs poorly compared to the previous model, then it might cause critical business problems and loss of trust. For example, let’s imagine we have updated a model with a new version tag, V2, that predicts a stock price. The V1 model version was predicting stock prices with an MSE of 10.0. Although during training the V2 model was performing very well, in production, we noticed that the V2 model was giving an MSE of 20.0. Therefore, if we directly deployed model V2 in replacement of V1, we could lose customer trust. By following the ensemble process of updating the model, we can avoid this risk.

Therefore, in this case, we keep both the old model and the new model in the production system. The models perform the following operations for a certain period that we can call the evaluation period:

  • Old model: Keeps providing predictions as before
  • New model: Predictions from the new model are compared with the performance of the old model

The results from the new and old models might match in most cases. However, for some inputs, the response might not match. In that case, we have to manually verify which model gave the correct output and then, finally, compute which model showed better accuracy in the differing responses. For example, let’s look at the dummy data shown in Figure 10.1:

Actual level

Prediction by old model

Prediction by new model

Class A

Class A

Class A

Class B

Class B

Class B

Class A

Class A

Class A

Class B

Class B

Class B

Class A

Class A

Class A

Class B

Class B

Class B

Class A

Class B

Class A

Class A

Class B

Class A

Class A

Class B

Class A

Class B

Class B

Class A

Figure 10.1 – Dummy data showing actual levels and predictions from the old model and the new model

In the table in Figure 10.1, the first column shows the actual level, the second column shows the predictions made by the old model, and the third column shows the predictions made by the new model during an evaluation period. The bold rows in the table are the rows where the predictions from the old model and the new model are different. We can see that the old model only made one correct prediction in these four rows and the new model made three correct predictions, so, the prediction accuracy of the new model is ¾ = 75%, and the prediction accuracy of the old model is ¼ = 25%. Therefore, we can decide to use the new model, as the accuracy is satisfactory compared to the old model.

The following code snippet shows a dummy example of using two models as an ensemble while updating:

def load_model(filename):
     print("Loading the model:", filename)
def predict_model_current(X):
     model = load_model("model_update/model_current/model.txt")
     print("Current model is predicting for ", X)
     return "dummy_pred_current"
def predict_model_new(X):
     model = load_model("model_update/model_new/model.txt")
     print("New model is predicting for ", X)
     return "dummy_pred_new"
def predict(evaluation_period, X):
     if evaluation_period == True:
           pred_current = predict_model_current(X)
           pred_new = predict_model_new(X)
           file = open("evaluation_data.csv", "a")
           file.write(f"{pred_current}, {pred_new}")
           return pred_current
     else:
           return predict_model_current(X)

In our code, we have not used an actual model; rather, we have stored the dummy text file in the folders meant to save the current and new models. The directory structure of the storage is shown in Figure 10.2. We can see from the figure that we have stored just two text files. You will store two different models here after the training:

Figure 10.2 – Directory structure of storing two parallel models to use as an ensemble during model update

Figure 10.2 – Directory structure of storing two parallel models to use as an ensemble during model update

As shown in the code snippet, during the evaluation period, we will get a prediction from both of these models. However, we only return pred_current from the current model to the user. We save both of the predictions to a separate file that will be used for evaluation later on. We write the predictions that we got from the last code snippet to a CSV file using the following code:

file = open("evaluation_data.csv", "a")
file.write(f"{pred_current}, {pred_new}")

Then, we can manually move the new model to the current model directory if the evaluation results succeed. Note that in the predict method, we use both models. The first model is used if the condition of the if statement is true; otherwise, we use the other model inside the else block as seen from the predict method code snippet. Predictions from the current model are accessed using the predict_model_current API and predictions from the new model are accessed using the predict_model_new API.

This is how ensembles work during a model update. Although during the actual deployment, there will be actual models instead of dummy files and the locations can also be two totally different servers, the concept of model update using the ensemble pattern will remain the same.

Aggregation

In this approach, we aggregate the response from multiple models and send the aggregated response to the users. Aggregation usually happens in the following two ways for regression and classification problems, respectively:

  • For regression problems, we take the average of the response from multiple models and use the average as the prediction. For example, let’s say we have two models to predict the price of a house. One is M1, a RandomForestRegression model, and the other is M2, an AdaBoostRegression model. Let’s assume M1 has predicted the price of a house as $100,000 and M2 has predicted the price of the house as $120,000. We aggregate the response from them by taking the average and returning ($100,000 + $120,000)/ 2 = $110,000 to the clients.
  • For classification problems, we take the majority class that is selected by the models as the prediction. For example, let’s assume five models are classifying handwritten digits. The five models respectively predict [1, 2, 1, 2, 1]. Here, we will take the majority class that has been predicted. We notice that 1 has been predicted by three models and 2 has been predicted by two models. Therefore, we will return 1 as the final predicted class to the client.

Let’s assume that we have three models, M1, M2, and M3, served in the ensemble pattern to predict the price of a stock. Let’s say the feature set of the stock is X and we need to predict the price with this feature set. The responses from the models are as follows:

Y1 = M1(X)
Y2 = M2(X)
Y3 = M3(X)

Therefore, the final response that will be returned to the user is the following:

Y = (Y1 + Y2 + Y3)/3

Usually, averaging the responses helps provide a more accurate prediction compared to the prediction from a single model. Therefore, the response from the ensemble pattern creates more confidence among the customers.

For example, let’s say the actual price of the stock is $10. The M1, M2, and M3 models made predictions of $8, $12, and $13, respectively.

The average of these three predictions is (8 + 12 + 13)/3 = 11.

We notice that although the predictions made by the models differ by -$2, $2, and $3, respectively, the aggregated response differs by $1, so we got a response that is closer to the actual price compared to the predictions made by the individual models. It’s not necessary for the aggregate response to always perform better than all the individual models, but most of the time, the average prediction will outperform the individual predictions. For example, if the three models make predictions of $11, $10, and $11, respectively, the average response is ~$10.67. Although this is better than the prediction from the first and third models, it is not better than the prediction from the second model.

Besides the aggregation techniques we just defined, there can be many other aggregation techniques for both regression and classification tasks. For example, instead of taking the direct mean, we can use the geometric mean for aggregating the results of multiple models, we can use weighted means to provide different weights to different models, and so on. For classification, we can also use different variants of majority class selection algorithms, weighted majority algorithms, and so on.

In the next sub-section, we will discuss selecting the majority class in the classification problem using the Boyer-Moore algorithm.

Selecting the majority class

For classification problems, we can use the majority voting algorithm to select the class. For example, let’s say we have three classification models: M1, M2, and M3. The models make predictions, as follows, for an input, X:

C1 = M1(X)
C1 = M2(X)
C2 = M3(X)

Now, we can use the majority voting algorithm to find out which class has been predicted by the highest number of models. In the preceding example, the majority class is C1, which is predicted by two out of the three models. One version of the majority algorithm is known as the Boyer-Moore majority voting algorithm, which finds an element that appears at least N/2 times out of the N numbers in a list or sequence.

For example, if the input array is [1 1 1 3], the Boyer-Moore algorithm will say the majority element is 1, as it appeared more than 4/2 = 2 times. Here, N = 4 is the number of elements in the array.

Boyer-Moore

To learn more about the Boyer-Moore algorithm, you can check out the following link: https://www.geeksforgeeks.org/boyer-moore-majority-voting-algorithm/.

In our aggregation strategy, the Boyer-Moore approach will not always work. For example, let’s assume we have six models and they make predictions of [C1 C1 C2 C3 C4 C5]. Here, none of the elements appeared more than 6/2 = 3 times. We still need to provide the prediction as C1, as it appeared the highest number of times. Therefore, we can use a majority voting algorithm that will select the class that has been predicted by the highest number of models. An example of finding the majority class is as follows:

from collections import Counter
x = ['C1', 'C1', 'C1', 'C2', 'C2', 'C3', 'C3']
counts = Counter(x)
print("Counts of different elements", counts)
major_element = counts.most_common(1)[0][0]
print("Major element", major_element)

The output of the preceding program is as follows:

Counts of different elements Counter({'C1': 3, 'C2': 2, 'C3': 2})
Major element C1

In this program, first, we compute the count of different classes in the array and then select the class that has the highest count using the major_element = counts.most_common(1)[0][0] line. Here, 1 is passed as an argument to the most_common(n) method to select the top most common element. This method returns an array of tuples, so we select the first element of the first tuple in the array using the [0][0] indices.

Model selection

We can serve different models that have been specialized for different problems following the ensemble pattern. For example, we can have a model to detect the names of different fruits. Different fruits have different features and therefore the prediction task of the fruits can be seen as separate problems; we can have separate models to solve each of those problems. We can serve these models together to form a complete fruit detection system.

Based on a feature in the input, we will select the appropriate model. For example, let’s say we want to design an ML system that can detect the class of a fruit and the class of a flower. The features for flower detection and fruit detection will be different. There can be two different models for handling the inference for these two different inputs: one for flowers and the other for fruit. An example of this is shown in Figure 10.3, where we select the model to detect either a flower or fruit based on the input:

Figure 10.3 – Ensemble pattern serving two models with the option to select a particular model

Figure 10.3 – Ensemble pattern serving two models with the option to select a particular model

We can aggregate the responses from more than one model for flowers, and more than one model for fruits, which makes serving more complicated. Therefore, instead of a single model for a particular problem, we will have multiple models. The responses will be aggregated before providing an overall response to the client.

Combining responses

ML is acquiring more and more responsibilities as time goes on. We now use ML models to describe pictures and contexts, drive autonomous vehicles, and so on. In many of these scenarios, we might need the responses from multiple models to be combined. For example, let’s say we are describing a painting and we want the following descriptions:

  • Detect what colors are used in the picture
  • Find out the painter’s name from the signature
  • Detect different objects in the picture

We might create a separate model for each of these cases and then combine the responses to provide an overall summary of the painting.

In this section, we have seen different techniques in the ensemble pattern. In the next section, we will discuss an end-to-end example of serving two regression models together, and then we will conclude the chapter.

End-to-end dummy example of serving the model

In this section, we will create an end-to-end dummy example of serving the two regression models together, and then we will combine their responses by averaging them. The models we will use are the following:

  • The RandomForestRegression model
  • The AdaBoostRegression model

Let’s describe the process step-by-step:

Note

Please keep in mind that the output may be different in your case from the following steps, as you will train the model with some random data generated using the make_regression function.

  1. First, let’s create the two models with some dummy data and save the models using pickle. The following code snippet creates the models and saves the trained models:
    from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
    from sklearn.datasets import make_regression
    import pickle
    X, y = make_regression(n_features=2, random_state=0, shuffle=False, n_samples=20)
    model1 = RandomForestRegressor(max_depth=2)
    model1.fit(X, y)
    print(model1.predict([[0, 0]]))
    pickle.dump(model1, open("rfreg.pb", "wb"))
    model2 = AdaBoostRegressor(n_estimators=5)
    model2.fit(X, y)
    pickle.dump(model2, open("adaboostreg.pb", "wb"))
  2. Now, we create another file that will be used to handle client requests. We will not create a server using Flask, as we have already shown how to do so in earlier chapters. Feel free to make a Flask API for this step. We will use the following code snippet to create a dummy serving:
    from sklearn.ensemble import RandomForestRegressor, AdaBoostRegressor
    import pickle
    model1: RandomForestRegressor = pickle.load(open("rfreg.pb", "rb"))
    model2: AdaBoostRegressor = pickle.load(open("adaboostreg.pb", "rb"))
    def predict_model1(X):
         response = model1.predict(X)
         print("Response from model 1 is", response)
         return response
    def predict_model2(X):
         response = model2.predict(X)
         print("Response from model 2 is", response)
         return response
    def predict(X):
         response1 = predict_model1(X)
         response2 = predict_model2(X)
         response = (response1 + response2)/2
         print("Final response is ", response)
    predict([[0, 0]])

In this code, we have two methods to get predictions from two different models. Then, we have the predict method, which will be called by the client through an API. We combine the responses from the two models inside this method and print that final response. If we call the predict method with [[0, 0]], we will get the following output in the console:

Response from model 1 is [8.1921427]
Response from model 2 is [-11.31397077]
Final response is  [-1.56091404]

Have a look; we get responses from both models and then average the responses to provide the final response.

In this section, we have discussed a dummy example to demonstrate how can we serve the models in the ensemble pattern. In the next section, we will summarize the chapter and conclude.

Summary

In this chapter, we have discussed the ensemble pattern of model serving. We were introduced to the ensemble pattern and different types of approaches to using it.

We have discussed how this pattern can be of use when we need to carefully update a new model, when we need predictions from multiple models to increase the prediction accuracy, when we need an option for multiple models based on different inputs, and when we need to combine the responses from multiple models to produce a final output.

In the next chapter, we will discuss the business logic pattern to serve ML models. We will discuss how while serving an ML model, we might need different business logic, such as user authentication or querying a database.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.19.244.187