Chapter 18. What’s Next for Recs?

We find ourselves in a transitionary time for recommendation systems. However, this is quite normal for this field as it is in many segments of the tech industry. One of the realities of a field that is so closely aligned with business objectives and with such strong capabilities for business value, is that the field tends to be constantly searching for any and all opportunities to advance.

In this section, we’ll give you a brief introduction to some of the modern views of where recommendation systems are going. An important thing to consider is that recommendation systems as a science spreads both depth-first and breadth-first simultaneously. Looking at the most cutting edge research in the field means that you’re seeing deep optimization in areas that have been under study for decades, or areas that seem like pure fantasy for now.

We’ve chosen three areas to focus on in this final chapter. The first we’ve seen a bit of throughtout this text: multi-modal recommendations. This area is increasingly important as users turn to platforms to do more things. Recall that multi-modal recommendations are when a user is represented by several latent vectors simultaneously.

Next up is graph-based recommenders. We saw co-occurrence models, which are the simplest such models for graph-based recommendation systems. They go much deeper! Graph Neural Networks are becoming an incredibly powerful mechanism for encoding relations between entities and utilizing these representations, making them useful for recommendations.

Finally, we’ll turn our attention to Large Language Models and Generative AI. During the writing of this book, LLM’s have gone from something that a small subset of ML experts understood, to something mentioned on HBO comedy broadcasts. While there is a rush to find relevant applications of LLMs to Recommendation Systems, there are already ways in which the industry has confidence in applying these tools. Also exciting, however, is the application of Recommendation Systems to LLM Apps.

Let’s see what’s coming next!

Multi-modal recommendations

Multi-model recommenders allow for the concession that users contain multitudes, that is: a single representation for a user’s preferences may not capture the entire story. Consider when you’re shopping on a large everything-e-commerce website; you may be:

  1. a dog owner, who frequently needs items for their dog

  2. a parent, who is always updating the closet for the growing baby

  3. a hobbiest race-car driver, who buys the pieces necessary drive their car on a track

  4. a LEGO investor, who keeps hundreds of sealed boxes of Star Wars sets hidden away in the closet

The methods you’ve learned throughout this book should do well at providing you recommendations for all of the above. However, you may notice in this list a few areas that are conflicting:

  1. if your child is very young why do you buy LEGO already, also doesn’t your dog chew on them?

  2. if you’re garage is full of LEGO sets, where do you keep all these car parts?

  3. where do you put your dog in that 2-seater Mazdaspeed MX-5 Miata?

You can probably think of other cases where some aspects of what you buy just don’t match up well with others. This leads to a problem of multi-modality, i.e. there are several places in the latent space of your interests which coalesce into modes or medoids, but not only one.

Let’s return to some of our geometric discussions from before: if you are using nearest neighbors to a user vector, then which of the medoids will take on the most importance?

The way we approach this problem is by multi-modality, or providing several vectors associated to a single user. While a naive approach to scaling to consider all the modes for a user would be to simply increase the dimensionality of the model on the item side (to create more areas in which different types of items can be embedded disjointly), this presents serious challenges at scale in terms of training and memory concerns.

One of the first significant works in this area is by one of this book’s authors, and introduces an extension to Matrix Factorization to deal with this. The goal is to build multiple latent factors simultaneously as we did in our other matrix factorization methods, each factor hopefully taking on representation for one of the users interests.

This is achieved by constructing a tensor where the third tensor dimension is to represent each of the latent factors for distinct interests rather than encoding a user item factorization matrix. The factorization is generalized to the tensor case, and the WSABIE loss you saw earlier is used to train.

Building on this work, several years later Pinterest released PinnerSage as we saw in the last chapter. This modifies some of the assumptions of the above paper, by not assuming a known number of representations for each user. Additionally, this paper uses graph-based feature representations which we’ll talk more about in the next section. Finally, the last important modification that this method uses is via clustering – it attempts to build the modes via clustering in item space.

The basic approach is to:

  1. fix item embeddings (they call these pins)

  2. cluster user interactions (unsupervised and unspecified in cardinality)

  3. build cluster representations as the medoid of the cluster embeddings

  4. retrieval is based on medoid anchored approximate nearest neighbors search.

This paper is still considered to be near state of the art for large scale multi-modal recommenders. Some systems take another approach to allow users to more directly modify their “mode”, by selecting the theme of what they’re looking for, while others hope to learn it from a sequence of interactions.

Next up, we’ll look at how higher order relationships between items or users can be explicitly specified.

Graph-based recommenders

Graph Neural Networks (GNNs) are a class of neural networks that use the structural information of data to build deeper representations of your data. They’ve proven especially useful when dealing with relational or networked data, both of which have utility.

One moment of disambiguatiton before we continue: graphs in the sense that we will use them here refer to collections of nodes and edges. These are purely mathematical concepts but generally one can think of nodes as the objects of interest and edges as the relationships between them. These mathematical objects are useful for distilling down the core of what is necessary for the kind of representation you wish to build. While the objects may seem very simple, there are a variety of ways that we can add just the right amount of complexity to capture more nuance.

In the simplest setups, each node on the graph represents an item or user, and each edge represents a relationship; e.g., a user’s interaction with an item. However, user-to-user, and item-to-item networks are extremely powerful extensions as well. Our co-occurrence models are simple graph networks, however we did not learn a representation from these, and instead took these as our models directly.

Let’s consider a few examples of adding additional structure to a graph to encode ideas:

  1. Directionality – or an ordering on an edges vertices – can be added to suggest that the relationship has a strict relationship on which node acts on the other; e.g. a user reads a book, but not the other way around.

  2. Edge labels or more generally, Edge decorations, can be added to communicate features about the relationships; for example two users share account credentials and one of the users is identified as a child.

  3. Multi-edges can allow for relationships to have higher multiplicity, or allow for the same two entities to have multiple relationships; in a graph of outfits with nodes as clothing items, each edge can be another clothing item which makes the other two go well together.

  4. A step further up the level of abstraction may add Hyper-edges which are edges that connect multiple nodes simultaneously; for video scenes, you may do object detection of various classes, and your graph may have nodes for those classes, but understanding not only which pairs of object classes appear but what higher-order combinations appear can be identified with hyper-edges.

Let’s explore the basics of GNNs and how their representations are a bit different.

Neural Message Passing

In GNN’s our object of interest is assigned as the nodes in our Graph. Usually, the main objective in GNNs is to build powerful representations of the nodes, edges, or both via their relationships.

The fundamental difference between GNNs and traditioanl Neural Networks, is that during the training, we’re explicitly using operators that transfer data between node representations “along the edges”. This is called message passing. Let’s start with an example to prime the basic idea.

Let nodes represent users, and their features are things like their persona features: demographic, onboarding survey question, etc. Let edges be the social network graph: are they friends? And let’s add some decoration to the edges, such as the number of DM’s exchanged between them on the platform. If we are the social media company who wants to introduce ad-shopping to our platform, we may start with those persona features but we’d ideally like to use something about this network of communication. In theory, people who communicate and share content with each other a lot, may have similar taste. Somewhat tellingly we introduce a concept called a message function, which allows features to be sent from node-to-node. The message function uses features from each node, and the edge between them, written mathematically as:

mij(k)=(hi(k),hj(k),eij)

for hi(k) the features at node i and hj(k) at node j respectively. The features of the edge are eij and is some differentiable function. Note that the superscript (k) refers to the layer as is normal in back-prop notation. Some simple examples include:

  1. mij(k)=hi(k) which means “take the features from a neighbor node”

  2. mij(k)=hi(k)cij which means “average by the number of edges between i and j”

There are many very powerful message passing schemes that use learning from other areas of Machine Learning – like adding an attention mechanism on node features – but this book won’t dive deep into this theory.

The next function we’ll introduce is the aggregation function, which is a function which takes as input the collection of messages and aggregates them. The most common ones are very natural:

  1. concatenate all the messages

  2. sum all the messages

  3. average all the messages

  4. take the max of the messages

Finally, we will use the output of the aggregation as part of our update function, which takes node features, aggregated message functions, and then applies additional transformations. If you’ve been wondering “where does this model learn anything?”, the answer is in the update function. The update function usually has a weight matrix associated to it, so as you train this neural network you are learning the weights in the update function. The simplest examples of update functions are to multiply a weight matrix by the vectorized output of your aggregation, and then apply an activation function per vector.

This chain of message-pass, aggregate, update is the core of GNNs, and encompasses a broad capability. They’ve been useful for ML tasks of every kind, including recommendations. Let’s see some direct applications to recommendation systems.

Applications

Let’s revisit some of the high-level ideas that GNN’s may touch in the recsys space.

Modeling User-Item Interactions

In other methods we’ve seen, such as matrix factorization, the interactions between users and items are considered, but the complex network among users or items is not exploited. In contrast, GNNs can capture the complex connections in the user-item interaction graph and then use the structure of this graph to make more accurate recommendations.

Thinking back to our message passing, it allowed us to “spread” the information of some nodes (in this case user and items) to their neighbors. An analogy for this would be that as a user interacts more and more with items with specific features, some of those features are imbued onto the user. This may sound similar to latent features, because it is! These are ultimately helping the network build a latent representation from the messages that pass features from items to user. This can be even more powerful than other latent embedding methods, because you explicitly define what the structural relationships are and how they communicate these features.

Feature Learning

GNNs can learn more expressive feature representations of nodes (users or items) in a graph by aggregating feature information from their neighbors, leveraging the connections between nodes. These learned features can provide rich information about users’ preferences or items’ characteristics, which can greatly enhance the performance of recommendation systems.

Previously we talked about how user’s representations can learn from the items they interact with, but items can learn from one another also. Similar to how item-item collaborative filtering allows items to pick up latent features from shared users, GNNs allow us to add potentially many other direct relationships between items.

Cold-start Problem

Recall our cold-start problem where it’s hard to provide recommendations for new users or items because of the lack of historical interactions. By using the features of nodes and the structure of the graph, GNNs can learn the embeddings for new users or items, potentially alleviating the cold-start problem.

In some of our graphical representations of our user graph, the edges need not only exist between users with lots of prior recommendations. It’s possible to use other user actions to bootstrap some early edges. Structural edges like “share a physical location” or “invited by the same user” or “answers onboarding questions similarly” can be enough to quickly bootstrap several user-user edges, which allow us to warm start recommendations for them.

Context-Aware Recommendations

GNNs can incorporate contextual information into the recommendation process. For example, in a session-based recommendation, a GNN can model the sequence of items a user has interacted with in a session as a graph, where each item is a node and the sequential order forms edges. The GNN can then learn the dynamic and complex transitions among items to make context-aware recommendations.

These high level ideas should point to the opportunity in graph encoding for recommender problems, but let’s look at two specific applications.

Random Walks

Random walks in Graph Neural Networks (GNNs) refer to methods that use random walks on the user-item interaction graph to learn effective node (i.e., user or item) embeddings. The embeddings are then used to make recommendations. In the context of graphs, a random walk is an iterative process of starting on a particular node, and the stochastically moving to another connected node via a randomized choice.

One popular random walk-based algorithm for network embedding is DeepWalk, which has been adapted and extended in many ways for various tasks, including recommendation systems.

Here’s how a random walk GNN approach might work in a recommendation context:

  1. Random Walks Generation Start by performing random walks on the interaction graph. Starting from each node, you make a series of random steps to other connected nodes. This results in a set of paths or “walks” that represent the relationships between different nodes.

  2. Node Embeddings The sequences of nodes generated by the random walks are treated similarly to sentences in a corpus of text, and each node is treated like a word. Word2Vec or similar language modeling techniques are then used to learn embeddings for the nodes (i.e., vector representations), such that nodes appearing in similar contexts (i.e., in the same walks) have similar embeddings.

  3. Recommendations Once you have learned node embeddings, you can use them to make recommendations. For a given user, you might recommend items that are “close” to that user in the embedding space, according to some distance metric. This can use all of the techniques we’ve previously developed for recommendations from latent space representations.

This approach has some nice properties:

  • It can capture the high-order connections in the graph. Each random walk can explore a part of the graph that’s not directly connected to the starting node.

  • It can help with the sparsity problem in recommender systems because it uses the structure of the graph to learn representations, which requires less interaction data.

  • It naturally attempts to handle cold-start issues. For new users or items with few interactions, their embeddings can be learned from connected nodes.

Nevertheless, there are some challenges with this approach. Random walks can be computationally expensive on large graphs, and it might be difficult to choose appropriate hyperparameters, such as the length of the random walks. Also, this approach may not work as well for dynamic graphs where interactions change over time, since it doesn’t inherently consider temporal information.

This method implicitly assumes that the nodes are hetergeneous and so co-embedding them via connections is very natural. While it was not an explicit requirement, the type of sequence embeddings DeepWalk builds tends to structurally assume this. Let’s break this rule to accomodate learning between heterogeneous types in our next architecture example, metapaths.

Metapath and Heterogeneity

Metapath was introduced to improve explainable recommendations and integrate the ideas of knowledge graphs with GNNs.

A metapath is a path in a heterogeneous network (or graph) that connects different types of nodes via different types of relationships. Heterogeneous networks contain different types of nodes and edges, representing various types of objects and interactions. Beyond simply users and items, the node types can be “carts of items” or “viewing sessions” or “channel used for purchase”.

Metapaths can be used in Graph Neural Networks for handling heterogeneous information networks (HINs). These networks provide a more comprehensive representation of the real world. When used in a GNN, a metapath provides a scheme for how information should be aggregated and propagated through the network. It defines the type of paths to be considered when pooling information from a node’s neighborhood.

For example, in a recommender system, you might have a heterogeneous network with users, movies, and genres as node types, and “watches” and “belongs to” as edge types. A metapath could be defined as “User - watches → Movie - belongs to → Genre - belongs to → Movie - watches → User”. This metapath represents a way of connecting two users through the movies they watch and the genres of those movies.

A popular method that utilizes metapaths is the Heterogeneous Graph Neural Network (Hetero-GNN) and its variants. These models leverage the metapath concept to capture the rich semantics in HINs, enhancing the learning of node representations.

Metapath-based models have shown promising results in various applications, as they allow you to explicitly encode much more abstract relationships into the message passing mechanisms mentioned above.

If higher order modeling is your thing, buckle up for the last concept we’ll cover in this book. State of the art, and full of high-level abstractions. Language Model backed agents are at the absolute cutting edge of ML Modeling.

LLM Applications

All of the superlatives for LLMs have been used up. For that reason, we’ll just say: Large Language Models are powerful, and have a surprisingly large number of applications.

LLM’s are very general models, which allow users to interact with them via Natural Language. Fundamentally, these models are generative (they write text) and auto-regressive (what they write is determined by what came before). Because LLM’s can speak conversationally, they’ve been branded as general artificial Agents. It’s very natural to then ask “can an Agent” recommend things for me? Let’s start by examining how you may use an LLM to make recommendations.

LLM Recommenders

Natural languge is a wonderful interface to ask for recommendations. If I want a coworker’s recommendation for lunch, maybe I’ll show up at their desk and say nothing – hoping they’re remember their latent knowledge of my preferences, identify the time-of-day context, recall the availability of restaurants based on day-of-week, and keep in mind that yesterday I had a pastrami sandwich.

More effectively, would be to simply ask “any suggestions for lunch?”

Like my astute coworker, models may be more effective at providing recommendations if you simply ask them to. This also adds the capability of defining more precisely what kind of recommendation I want. A popular application of LLMs is to ask them for recipes that use a set of ingredients. Thinking through this in the context of the kind of recommenders we’ve built, there are some hurdles to building a recommender of this kind. It probably needs some user modeling, but it’s very dependent on the items specified. This means that there’s a very low signal for each combination of specified items.

An LLM on the other hand, is quite effective at the auto-regressive nature of this task: given a few ingredients, what’s most likely to be included next in the context of a recipe. By generating several items like this, a ranking model can augment this to provide a realistic recommender.

LLM Training

Large generative language models of the type that have exploded in popularity are trained in three stages:

  1. Pre-training for completion

  2. Supervised Fine-Tuning for dialogue

  3. Reinforcement Learning from Human Feedback.

Note that sometimes the latter two steps are combined into what is called Instruct. For an exceptionally deep dive into this topic, see the original InstructGPT paper.

Let’s recall that text-completion tasks are equivalent to training the model to predict the correct word in a sequence after seeing K previous ones. This may remind you of GLOVE from our first Putting it all Together chapter, or how we discussed sequential recommenders.

Next up is Fine-Tuning for dialogue; this step is necessary to teach the model that the “next word or phrase” should sometimes be a response, instead of an extension of the original statement.

During this stage, the data used for this training is in the form of demonstration data, i.e. pairs of statements and responses. Some examples include:

  1. a request, and then a response to that request

  2. a statement, and then a translation of that statement

  3. a long text, and then a summarization of that text.

For recommendations, you can imagine that the first is highly relevant to the task we hope the model to demonstrate.

Finally, we move to the RLHF stage; the goal here is to learn a reward function which we can later use to further optimize our LLM. However, the reward model itself needs to be trained. Interestingly for recommendation systems enthusiasts like yourself, they do this via a ranking dataset.

A large number of tuples – similar to the demonstration data above – provide statements and responses, although instead of only one response, there are a number of them. They are ranked (via a human labeler) and then for each pair of superior-inferior responses (x,sup,inf), we evaluate the loss:

  1. rsup=Θ(x,sup) the reward model’s score for the superior response

  2. rinf=Θ(x,inf) the reward model’s score for the inferior response

The final loss is computed: -log(σ(sup-inf)).

Equipped with this reward function; the model may be fine tuned directly via it.

OpenAI summarizes this approach via the following diagram:

INSTRUCT GPT IMAGE: https://cdn.openai.com/instruction-following/draft-20220126f/methods.svg

From this brief overview, we can see that these LLMs are trained to respond to requests – something well suited for a recommender. Let’s see how to augment this.

Instruct Tuning for Recs

In the previous discussion of instruct pairs, we saw that ultimately the training was learning a rank comparison between two responses. This kind of training should feel quite similar. In a recent paper TALLRec, the authors use a similar setup to teach user preferences to the model.

As the paper mentions, one collects historical interaction items into two groups based on their ratings: user likes and user dislikes. They collect this information into natural language prompts to format a final “Rec Input”:

  1. “User Preference: [item1,...,itemn]

  2. “User Preference: [item1,...,itemn]

  3. “Will the user will enjoy the “User Preference: [itemn+1"

These follow the same training pattern as InstructGPT from above. The authors achieve dramatically improved performance on recommender problems than an untrained LLM for recs, however those should be considered baselines as it’s not their target task.

LLM Rankers

So far in this chapter, we’ve thought of the LLM as a recommeder in totality, but instead, the LLM can be used as simply the ranker. The most trivial approach to this is to simply prompt the LLM with the relevant features of a user, and a list of items, and ask it to suggest the best options.

While naive, variants on this approach have seen somewhat surprising results in very generic settings: “the user wants to watch a scary movie tonight, and isn’t sure which will be the best if he doesn’t like gore: movie-1, movie-2, etc…​”. But we can do better.

Ultimately, like LTR approaches, we can think of pointwise, pairwise, and listwise. If we wish to use an LLM for a pointwise ranking, then we should constrain our prompting and responses, to a setting in which these models may be useful. Take for example a recommender for scientific papers; a user may wish to write what their working on, and the LLM helpfully suggest papers of relevance. While a traditional search problem, this is a setting in which our modern tools can bring a lot of utility: LLM’s are effective at summarizing and semantic matching, which means that from a large corpus semantically similar results may be found, and then the agent can synthesize the output of those results into a cogent response. The biggest challenge here is hallucination, or suggesting papers that may not exist.

Pairwise and listwise can be thought of similarly: distilling the reference data into a shape that the unique capabilities of these LLMs can make significant assists.

While we’re near the topic of search and retrieval, it’s important to mention one of the ways in which recommendation can help LLM applications: retrieval augmentation.

Recs 4 AI

We’ve seen how Large Language Models can be used to generate recommendations, but how to recommenders improve LLM applications? LLM agents are extremely general in their capabilities, but lack specificity on many tasks. If you ask an agent: “which of the books I read this year were written by non-western authors”, the agent has no chance of success. Fundamentally, this is because the general pretrained models have no idea what books you’ve read this year. To solve for this, you’ll want to leverage retrieval-augmentation, i.e. “providing relevant information to the model from an existing data store”. The data store may be a SQL Database, a lookup table, or a vector database, but ultimately the important component here is that somehow from your request, you’re able to find, and then provide relevant information to an Agent.

One assumption we’ve made here, is that your request is interpretable by your retrieval system. In the above example, you’d like the system to automatically understand the “which of the books I read this year” as an information retrieval task equivalent to something like

SELECT * FROM read_books
WHERE CAST(finished_date, YEAR) = CAST(today(), YEAR)

Here I’ve just made up a SQL database, but you can imagine schema to satisfy this request. Converting from the request to this SQL is now yet another task you need to model – maybe it’s the job of another Agent request.

In other contexts, you actually want a full scale recommender to help with the retrieval: if you want users to ask an agent for a movie tonight, but if you want to continue to use your deep understanding of their taste, you may first filter the potential movies by the user’s preference, and then only send movies your recommender model think are great for them. The Agent can then service the text request from a subset of movies that are already determined to be great.

The intersection of LLMs and Recommendation Systems is going to dominate much of the conversation in Recommendation Systems for a while. There’s a lot of low-hanging fruit in bringing the knowledge of recommender systems to this new industry. As Eugene Yan quoted recently:

I think the key challenge, and solution, is getting them [LLMs] the right information at the right time. Having a well-organized document store can help. And by using a hybrid of keyword and semantic search, we can accurately retrieve the context that LLMs need.

Wilfred Meynell

Summary

The future of recommendation systems is bright, but things will continue to get more complicated. One of the major changes over the last 5 years has been an incredible shift to GPU-based training, and the architectures which can use these GPUs. This is the primary motivation for why this book favors JAX over TensorFlow or Torch.

The methods in this chapter embrace bigger models, more interconnections, and potentially inference on a scale that’s hard to house in most organizations. Ultimately, recommendation problems will always be solved via:

  1. careful problem framing

  2. deeply relevant representations of users and items

  3. thoughtful loss functions which encode the nuances of the task

  4. and great data collection.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.131.38.219