Chapter 4. System Design for Recommending

Now that we have a foundational understanding of how recommendation system work, let’s take a closer look at what elements are needed and how to design a system that is capable of serving recommendations at industrial scale. Industrial scale in our context will primarily refer to Reasonable Scale[ML and MLOps at a reasonable scale], i.e. production applications for companies with 10’s-100’s of engineers working on the product, not 1000’s.

In theory, a recommendation system is a collection of math formulae which can take historical data about user-item interactions, and return probability estimates for user-item pair’s affinity. In practice, a recommendation system is 5, 10, maybe 20 software systems, communicating in real-time, working with limited information, restricted item-availability, and perpetually-out-of-sample behavior, all to ensure the user sees something.

This section is heavily influenced by the writing of Karl Higley, Even Oldridge, and Eugene Yan (c.f System Design for Recommendations and Search and Recommender systems not just recommender models).

Online vs Offline

Online vs. Offline
Figure 4-1. Real Time vs. Batch

Machine learning systems consist of the stuff that you do in advance, and the stuff that you do on the fly. This division, between online and offline, is a practical consideration about what information is necessary to perform tasks of various types. To observe and learn large-scale patterns, a system needs access to lots of data; this is the offline component. Performing inference, however, only requires the trained model and relevant input data. This is why many ML system architectures are structured in this way.

Frequent terminologies you’ll encounter for the two sides of the online-offline paradigm are batch and real-time. A batch process is something that does not require user input, often has longer expected time periods for completion, and is able to have all the necessary data available simultaneously. Batch processes often include things like training a model on historical data, augmenting one data set with an additional collection of features, or computationally expensive data transformations. Another characteristic you see more frequently in batch processes are that it works with the full relevant dataset involved, not only an instance of the data sliced by time or otherwise.

A real-time process is characterized by something that is carried out at time of request; said differently, something that is evaluated during the inference process. Examples include: providing a recommendation upon page load, updating next episode after the user finishes the last, re-ranking recommendations after one has been marked not interesting. Real-time processes are often resource-constrained because of the need for rapidity, but like many things in this domain, as the world’s computational resources expand, we change the definition of resource-constrained.

Let’s return to the previously introduced components–collector, ranker, server–and consider their roles in offline and online components. Recall the components introduced before ???.

Collector

The collector’s role is to know what is in the collection of things that may be recommended, and the necessary features or attributes of those things.

Offline Collector

The offline collector has access to, and is responsible for the largest data sets. Understanding all user-item interactions, user-similarities, item-similarities, feature stores for users and items, and building indices for nearest neighbor lookup are all under the purview of the offline collector. The offline collector needs to be able to access the relevant data extremely fast, and sometimes in large batches. For this purpose, they often implement sublinear search functions, or specifically tuned indexing structures to acheive this. They may also leverage distributed compute for these transformations.

It’s important to remember that the offline collector needs not only access and knowledge of these datasets, but also will be responsible for writing the necessary downstream datasets to be used in real-time.

Online Collector

The online collector uses the information indexed and prepared by the offline collector, to provide real-time access to the parts of this data necessary for inference. This includes things like searching for nearest neighbors, augmenting an observation with features from a feature store, or full inventory catalog knowledge. The online collector will also need to handle recent user behavior – this will become especially importantwhen we see sequential recommenders.

One additional role the online collector may take on is encoding a request. In the context of a search recommender, we wish to take the query, and encode it into the search space via an embedding model. For contextual recommenders, we need to encode the context into the latent space via an embedding model also.

Note

One popular subcomponent in the collector’s work will involve an embedding step (c.f. Machine Learning Design Patterns by Valliappa Lakshmanan, Sara Robinson, Michael Munn). The embedding step on the offline side involves both training the embedding model, and constructing the latent space for later use. On the online side, the embedding transformation will need to embed a query into the right space. In this way, the embedding model serves as a transformation that you include as part of your model architecture.

Ranker

The ranker’s role is to take the collection provided by the collector, and order some or all of them according to a model for the context and user. The ranker actually gets two components itself, the filtering and the scoring.

Filtering can be thought of as the coarse inclusion and exclusion of items appropriate for recommendation. Usually characterized by rapidly cutting away a lot of potential recommendations that we definitely don’t wish to show. A trivial example is not recommending items we know the user has already chosen in the past.

Scoring is the more traditional understanding of ranking: creating an ordering on potential recommendations with respect to the chosen objective function.

Offline Ranker

The offline ranker’s goal is to facilitate filtering and scoring. What differentiates it from the online ranker, is how it runs validation, and how the output can be used to build fast datastructures that the online ranker can utilize. Additionally, the offline ranker can integrate with a human review process for human in the loop machine learning.

An important technology that will be discussed later is the Bloom Filter. A bloom filter allows the offline ranker to do work in batch, so that filtering in real-time may happen much faster. An over simplification of this process would be to use a few features of the request to quickly select between subsets of all possible candidates. If this step can be made fast–in terms of computational complexity–striving for something less than quadratic in the number of candidates–then downstream complex algorithms can be made much more performant.

Second to the filtering step is the ranking step. In the offline component, ranking is training the model that learns how to rank items. As we will see later, learning to rank items to perform best with respect to the objective function, is at the heart of the recommendation models. Training these models, and preparing the aspects of their output, is part of the batch responsibility of the Ranker.

Online Ranker

The online ranker gets a lot of praise, but really utilizes the hard work of other components. The online ranker first does filtering, utilizing the filtering infrastructure built offline–for example an index lookup or a bloom filter application. After filtering, the number of candidate recommendations has been tamed, and thus we can actually come to the most infamous of the tasks: rank recommendations.

In the online ranking phase, usually a feature store is accessed to take the candidates and embellish them with the necessary details, and then a scoring and ranking model is applied. Scoring or ranking may happen in several independent dimensions, and then be collated into one final ranking. In the multi-objective paradigm, you may have several of these ranks associated to the list of candidates returned by a ranker.

Server

The server’s role is to take the ordered subset provided by the ranker, ensure that the necessary data schema is satisfied–including essential business logic–and return the requested number of recommendations.

Offline Server

The offline server is responsible for high level alignment of what are the hard requirements of recommendations returned from the system. In addition to establishing and enforcing schema, this can be more nuanced things like “never return this pair of pants when also recommending this top”. Often waved off as “business logic"–the offline server is responsible for creating efficient ways to impose top level priorities on the returned recommendations.

An additional responsibility for the offline server is handling things like experimentation. There’s a good chance at some point you’ll want to run online experiments to test out all the amazing recommendation systems you build with this book. The offline server is the place where you’ll implement the logic necessary to take experimentation decisions, and provide the implications in a way the online server can use them in real time.

Online Server

The online server takes the rules, requirements, and configurations established, and makes their final application to the ranked recommendations. A simple example would be diversification rules; as we will see later, diversification of recommendations can have a significant impact on the quality of a user’s experience. The online server can read the diversification requirements from the offline server, and apply them to the ranked list to return the expected number of diverse recommendations.

Summary

It’s important to remember that the online server is the endpoint that other systems will be getting a response from. While it’s usually where the message is coming from, much of the most complicated components in the system are upstream. Be careful to instrument this system in a way that when responses are slow, each system is observable enough that you can identify where those performance degradation are coming from.

Now that we’ve established the framework and understand the functions of the core components, we will discuss the aspects of ML systems next and the kinds of technologies there.

In this next section, we’ll get our hands dirty with the aforementioned coponents, and see how one might implement the key aspects. We’ll wrap it up by putting it all together into a production scale recommender using only the content of each item. Let’s go!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.164.241