Chapter 15. Bias in Recommendation Systems

We’ve spent much time in this book dissecting how to improve our recommendations, making them more personalized and relevant to an individual user. Along the way you’ve learned that latent relationships between users and user-personas encode important information about shared preferences. Unfortunately, there is a serious downside in all of this – bias.

For the purposes of our discussion, we’ll talk about the two most important kinds of bias for recommendation systems:

  1. overly redundant or self-similar sets of recommendations

  2. stereotypes learned by AI systems.

First, we’ll delve into the crucial element of diversity in recommendation outputs. As critical as it is for a recommendation system to offer relevant choices to users, it is essential to ensure there is a variety of recommendations. Diversity not only safeguards against overspecialization, but it also promotes novel and serendipitous discoveries, enriching the overall user experience.

The balance between relevance and diversity is delicate and can be very tricky. It challenges the algorithm to go beyond merely echoing the past behavior of users and encourages an exploration of new territories, hopefully providing a more holistically positive experience with the content.

This kind of bias is primarily a technical challenge – how does one satisfy the multi-objectives of diverse recommendations and highly relevant ones?

Then, we’ll consider the intrinsic and extrinsic biases in recommendation systems as an often-unintended yet significant consequence of both the underlying algorithms and the data they learn from. Systemic biases in data collection or algorithmic design can result in prejudiced outputs, leading to ethical and fairness issues. Moreover, they may create echo chambers or filter bubbles, curtailing users’ exposure to a broader range of content and inadvertently reinforcing pre-existing beliefs.

At the end of this chapter we will discuss the risks and provide resources to learn more about them. We are not experts in AI fairness and bias, but it is important that all Machine Learning practitioners understand and seriously consider these topics. We aim to provide an introduction and signposts.

Diversification of recommendations

Our first investment into fighting bias, is to explicitly target more diversity in our recommendation outputs. We’ll briefly cover two of the many goals you may pursue.

Intra-list diversity attempts to diversify the items within a single recommendation list, ensuring a mix of different types of items. The idea is to minimize similarity between the recommended items to reduce over-specialization and encourage exploration. High intra-list diversity within a set of recommendations increases user’s exposure to many things they may like; however, the recs for any particular interest will be more shallow, reducing the recall.

Serendipitous recommendations are those which are both surprising and interesting to the user. These are often items the user might not have discovered independently, or are generally far less popular in the system. Serendipity can be introduced into the recommendation process by injecting non-obvious or unexpected choices – even if those are relatively lower affinity score with the user – to improve over all serendipity. In an ideal world, these seredipitous choices are high affinity relative to other items of their popularity, so they’re the “best of the outside choices”.

Improving diversity

Now that we have our measures of diversity, we can explicitly attempt to improve them. Importantly, by adding diversity metrics as one of our objectives, we will be potentially sacrificing performance on things like Recall or NDCG. It can be useful to think of this as a Pareto problem, or to impose a lower-bound on ranking metric performance that you’ll accept in pursuit of diversity.

Note

A Pareto problem is a problem in which you have two priorities that often trade off with one another. In many areas of Machine Learning, and more generally applied mathematics, you have outcomes that have a natural tension. Diversity in recommendations are an important example of a Pareto problem in recommendation systems, but not the only one. In the last chapter we briefly saw Global Optimization which is an extreme case of tradeoffs.

One simple approach to improve with respect to our diversity metrics is via re-ranking – a post-processing step where the initially retrieved recommendation list is reordered to enhance diversity. Various algorithms for re-ranking consider not just the relevance scores but also the dissimilarity among the items in the recommendation list. Re-ranking is a strategy that can operationalize any external loss function, so using it for diversity is a straightforward approach.

Another strategy is to break out of the closed loop of recommendation feedback that we discussed in our section on Propensity Weighting. Like in multi-armed bandit problems, explore-exploit tradeoffs can choose between exploiting what it knows the user will like and exploring less certain options that may yield higher rewards. This tradeoff can be used in recommendation systems to ensure diversity by occasionally choosing to explore and recommend less obvious choices. To implement a system like this one can use affinity as a reward estimate and propensity as a exploitation measure.

Instead of using these posterior strategies, an alternative is to incorporate diversity as an objective in the learning process or include a diversity regularization term in the loss function. Multi-objective loss including pairwise similarity as a regularizer can help train the model to learn diverse sets of recs. We previously saw that kinds of regularlization can coach the training process to minimize certain behaviors. One regularlization term that can be used explicitly is similarity amongst recommendations; the dot product of each embedding vector in the recommendations with one another can approximate this self-similarity. Let =(R1,R2,...,Rk) be the list of embeddings for the recommendations, then consider as a column matrix – with each row a recommendations. Calculating ’s Gramian would yield us all of our dot-product similarity calculations, and thus we can regularize by this term with appropriate hyper-parameter weighting. Note that this differs from our previous Gramian regularlization because we’re only considering the recommendations for an individual query in this case.

Finally, one can use rankings from multiple domains to boost recommendation diversity. By integrating various ranking measures, the recommendation system can suggest items from outside the user’s “mode”, thus broadening the range of recommendations. There is a vibrant discipline around “multi-modal” recommendations, with the PinnerSage paper from Pinterest a particularly impressive implementation. In many of the works about multi-modal recs, the retrieval step is returning too many recommendations near to the user’s query vector. This forces self-similarity amongst the retreived list. Multi-modality forces multiple query vectors to be used for each request, allowing a built in diversity.

Let’s look at another perspective on item self-similarity, and think of how the pairwise relationships between items can be used to this end.

Diversity as a portfolio optimization problem

Portfolio optimization, a concept borrowed from finance, can be an effective approach to enhance diversity in recommendation systems. The goal here is to create a “portfolio” of recommended items that balances the two key parameters: relevance and diversity.

At its heart, portfolio optimization is about risk (in our case, relevance) and return (diversity). Here’s a basic approach on how to apply it to recommendation systems:

  1. Formulate an item representation such that the distance in the space is a good measure of similarity. This is in line with our previous discussions about what makes a good latent space.

  2. Calculate pairwise distance between items. This can be done using whatever distance metric that your latent space is enriched with. It is important to calculate these pairwise across all items retreived and ready for consideration to return. Note that how you aggregate these distributions of distances can be nuanced.

  3. Evaluate affinity for the retrieved set. Note that calibrated affinity scores will perform better as they provide a more realistic estimate of return.

  4. Solve the optimization problem. Solving the problem will yield a weight for each item that balances the trade-off between relevance and diversity. Items with higher weights are more valuable in terms of both diversity and relevance, and they should be prioritized in the recommendation list. Mathematically:

Maximize(wT*r-λ*wT*C*w)

Here, w is a vector representing the weights (i.e., the proportion of each item in the recommendation list), r is the relevance score vector, C is the covariance matrix (which captures the diversity), and λ is a parameter to balance relevance and diversity. The constraint here is that the sum of the weights equals 1.

Remember, the hyperparameter λ trades off between relevance and diversity. This makes it a critical part of this process and may require experimentation or tuning based on the specific needs of your system and its users. This would be straightforward via hyperparameter optimization in one of many packages such as Weights and Biases.

Multi-objective functions

Another related-approach to diversity, is to rank based on a multi-objective loss. Instead of the ranking stage being purely personalization affinity, introducing a second (or more!) ranking term can dramatically improve diversity.

The simplest approach here is something similar to what we learned in the last chapter: hard ranking. A business rule that may apply to diversity is only limiting each item-category to one item. This is the simplest case of multi-objective ranking because sorting by a categorical column and selecting the max in each group will acheive explicit diversity with respect to that covariate. Let’s move onto something more subtle.

In Stitching together spaces for query-based recommendations one of this book’s authors worked with co-author Ian Horn to implement a multi-objective recommendation system that balanced both personalization and relevance to an image retrieval problem.

The goal was to provide personalized recommendations for clothing that were similar to clothes in an image the user uploaded. This means there are two latent spaces:

  1. The latent space of personalized clothes to a user

  2. The latent space of images of clothing

To solve this problem, we first had to make a decision: what was more important for relevance? Personalization or Image Similarity? Because the product was centered around a photo-upload experience, we chose image-similarity. However, we had another thing to consider; each uploaded image contained several pieces of clothing. As is popular in computer vision, we first segmented the model into several items, and then treated each item as it’s own query (which we called anchor-items). This meant our image-similarity retreival was “multi-modal” as we searched with several different query vectors. After we gathered them all, we had to make one final ranking. That ranking was a multi-objective ranking for image-similarity and personalization; the loss function we optimized was:

si=α×(1-di)+(1-α)×ai

for α a hyperparameter that represents the weighting, di the image distance and ai the personalization. We learn α experimentally. The last step was to impose some hard-ranking to ensure one recommendation came from each anchor.

So let’s sum this up:

  1. We used two latent spaces with distances to provide rankings

  2. We did multi-modal retreival via image segmentation

  3. We retreived using only one of the rankings

  4. Our final ranking was multi-objective with hard ranking utilizing all of our latent spaces, and business logic

This allowed our recommendations to be diverse in the sense that they acheived relevance in several different areas of the query which corresponded to different items.

Predicate pushdown

You may be happy and comfortable in applying these metrics during serving – after all, that’s the title for this part of the book – but before we move on from this topic, we should talk about an edge case which can have quite disasterous consequences.

Sometimes, when you impose the hard rules from the last chapter, the diversity expectations discussed earlier in this chapter, and do a little multi-objective ranking, what you arrive at are…​ no recommendations. You started by retrieving K items, but after the sufficiently diverse combinations that also satisfy business rules, there’s simply nothing left. You might say:

“I’ll just retrieve more items, let’s crank up K!”. But this has some serious issues; it can really increase latency, it can depress match quality, and it can throw off your ranking model which is more tuned to lower cardinality sets. A common experience, especially with diversity, is that different modes for the retreival have vastly different match scores. To take an example from our Fashion Recommender world: all jeans might be a better match than any shirt we have, but if you’re looking for diverse categories of clothes to recommend, no matter how big the K you’ll potentially be missing out on shirts.

One solution to this problem is called predicate pushdown. Predicate pushdown is an optimization technique used in databases, specifically in the context of data retrieval. The main idea of predicate pushdown is to filter data as early as possible in the data retrieval process, to reduce the amount of data that needs to be processed later in the query execution plan.

For traditional databases, you see predicate pushdown applied to things like “apply my query’s Where clause in the database to cut down on I/O”. It may achieve this by explicitly pulling the relevant columns to check the where clause first, and then getting the row_ids from those which pass before executing the rest of the query.

How does this help us in our case? The simple idea is if your vector store also has features for the vectors, you can include the feature comparisons as part of retreival. Let’s take an overly simplistic example: assume your items have a categorical feature called color, and for good diverse recs you want a nice set of at least 3 different colors in your 5 recommendations. To acheive this, you can do a top K search across each of the different colors in your store (the downside is that your retreival is C times as large where C is the number of colors that exist), and then do ranking and diversity on the union of these sets. This has a much higher likelihood of surviving your diversity rule in the eventual recommendations. This is great! We expect that latency is relatively low in retrieval so this tax of extra retrievals isn’t bad if we know where to look.

This can be applied on quite complicated predicates if your vector store is set up well for the kinds of things you wish to impose.

Fairness

Fairness in Machine Learning in general is a particularly nuanced subject. We will invite you to consider some other more robust references. These topics are important, and are ill-served by short summaries.

Nudging

Fairness does not need to be only “equal probabilities for all outcomes”; it can be fair with respect to a specific covariate. “Nudging” via a recommender, i.e. recommending things to emphasize certain behavior or buying patterns, can increase fairness. Consider the work by Karlijn Dinnissen and Christine Bauer from Spotify: use nudging to improve gender representation in music recommendations.

Filter Bubbles

Filter Bubbles are a downside of extreme collaborative filtering: a group of users begin liking similar recommendations, the system learns that they should receive similar recommendations, and the feedback loop perpetuates this. For a deep look into not only the concept, but mitigation strategies, consider Mitigating the Filter Bubble While Maintaining Relevance

High risk

Not all applications of AI are equal in risk. Some domains are particularly harmful when AI systems are poorly guardrailed. For a general overview of the most high-risk circumstances and mitigation, consult Machine Learning for High-Risk Applications.

Trustworthiness

Explainable models is a very popular mitigation strategy for risky applications of AI. While explainability does not solve the problem, it frequently provides a path towards identification and resolution. For a deep dive on this, Practicing Trustworthy Machine Learning provides tools and techniques.

Fairness in recs

Because recommendation systems are so obviously succeptible to issues of AI fairness, much has been written on the topic. Each of the major Social Media giants have employed teams working in AI Safety. One particular highlight is the Twitter Responsible AI team led by Rumman Chowdhury. You can read about their work here.

Summary

While these techniques provide pathways to enhance diversity, it’s important to remember to strike a balance between diversity and relevance. The exact method or combination of methods used may vary depending on the specific use-case, the available data, the intricacies of the user base, and the kind of feedback you’re collecting. As you implement recommendation systems, think about which aspects are the most key in your diversity problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.20.193