Chapter 11. Semantic and personalized search

This chapter covers

  • Making search personalized for individual users
  • Matching documents based on meaning rather than just words
  • Implementing recommendation as a generalization of search

You’re at the end of a long journey. You’ve learned to use search technology to build relevant search applications. But you’re still just scratching the surface. In this final chapter, we look toward the horizon to explore some of the more novel—and experimental—ways to improve your users’ search experience. In particular, we cover two related techniques that can provide better relevance:

  • Personalized search provides search results customized to a user’s particular tastes using knowledge about that user. User information can be gleaned from users’ previous interactions as well as anything they tell us directly.
  • Concept search ranks documents based on concepts extracted from text, not just words. Concept search relies on deep knowledge of the search domain, including jargon and the relations between concepts in that domain.

When used in tandem, the search solution understands users personal needs as well as the ideas latent in the content.

Building a good personalized search or concept search requires a considerable amount of work. You should treat the methods in this chapter as a conceptual starting point. But please realize that these methods bear some technical risks; they can be difficult to implement, and you often won’t know how these methods will perform until after they’ve been implemented. Nevertheless, for established search applications, these methods are worth careful investigation because of the profound benefits that they may offer. In the discussion that follows, we present several ideas for implementing personalized and concept search. In both cases, we start with relatively simple methods and then outline more sophisticated approaches using machine learning.

In the process of laying out personalized search, we introduce recommendations. You can provide users with personalized content recommendations even before they’ve made a search. In addition, you’ll see that a search engine can be a powerful platform for building a recommendation system. Figure 11.1 shows recommendations side-by-side with search, implemented by a relevance engineer.

Figure 11.1. By incorporating knowledge about the content and the user, search can be extended to tasks such as personalized search and recommendations.

11.1. Personalizing search based on user profiles

Until now, we’ve defined relevance in terms of how well a search result matches a user’s immediate information need. But over time, as you get to know users better, you should be able to incorporate their tastes and preferences into the search application itself. This is known as personalized search.

Throughout this book, we’ve emphasized that, at its core, a search engine is a sophisticated token-matching and document-ranking system. We discussed techniques to ensure that search matches and ranks documents to reflect your notion of relevance. As we move on to personalized search, the fundamental nature of a search engine doesn’t change. No special magic or hidden feature makes personalization possible. Search is still about crafting useful signals, and modeling them with the search engine through queries and analysis. The main difference is that instead of drawing information exclusively from documents, you’ll look to the users themselves as a new source of information.

With this in mind, we turn our attention to the first method of building personalized search: profile-based personalization. With this method, you track knowledge of individual users with profiles. At query time, you refer to the user profile and use its information to boost documents that correspond to the user’s tastes. Figure 11.2 demonstrates profile-based personalization using our previous Star Trek examples.

Figure 11.2. Adding personalization with user profile data, including demographics and preferences

11.1.1. Gathering user profile information

But how do you gather information for your user profiles? Well, if you’re fortunate enough to have an engaged user base, you can create a profile page and wait for users to tell you about themselves. Be sure to provide incentives for users to fill out their profiles. For socially oriented sites, make the profiles public so your users can project their personality through the profile. Allow users to describe themselves in free text. Let them tag their profiles with categories that interest them. For private profiles, help users understand how creating a profile can provide a more personalized experience. For instance, you can directly ask users about their preferences and indicate that this will influence the behavior of the application. You can incentivize profile building with functionality; for example, by letting users bookmark items that they like, or share items with friends.

If you lack a profile page, you can still gather profile information from user interactions. By observing search behavior, underlying themes will reveal themselves over time. Perhaps a user has historically preferred certain brands. Maybe a user’s choices indicate an interest in a particular domain such as photography or video games. By watching a user’s interactions, you might identify demographics such as age, gender, and income level. The way that users filter searches often reveals how they make purchasing decisions. For instance, if users narrow search results by product reviews or price, you’ve learned something about that user’s priorities. All of this information can be used to tune a user’s search experience.

11.1.2. Tying profile information back to the search index

As you gather user profile information, consider how this can be used in a search solution. In some cases, the connection is easy to find. For instance, if the user shows an affinity toward a particular brand, you can subtly boost that brand. And if a user often looks at reviews, boost items with several positive reviews.

But be careful how you do this boosting, because you can create an unexpected feedback loop that can damage search relevance. For instance, if the user buys Acme Co. products more than once, you should probably boost Acme’s presence in that user’s search results. But if that boost is overwhelming, the search page might be flooded with only Acme products—and less relevant Acme products at that. In future searches, users will show an increased interaction with Acme products, not because they like them, but because your boost makes Acme products so much more prevalent than other products. To make the situation worse, these interactions may look like an increased preference for Acme products and drive further boosting. With Acme products everywhere, you might soon find that customers quit using your search application altogether. Therefore, it’s best to alter the boosting for particular product categories only when you have definitive evidence of a preference for that category—such as a purchase rather than just a product page view.

Sometimes you’ll need to add new signals to the index in order to match information from the user profile. If you know that a user prefers less expensive, “higher value” products, you can’t simply boost all items below $20. A $20 blender is a great value, whereas a $20 can of beans is quite pricey. Instead, you should associate documents and users with some sort of general value rating scale (“cheap” to “boutique”).

Demographic information such as age and gender can be another good set of information to pull into search. Let’s say a profile indicates that the user is a young adult and male. If you know which products sell better in this demographic, give them a boost in the search results. To accomplish this, include a field in the indexed documents listing demographic groups with a high affinity to this item. The task of annotating this field with demographic information likely falls to the content curator. The information itself will probably come from marketing research.

With sufficiently heavy traffic, another source for demographic data is your search logs. Count the number of sales that occur within various demographics. The next time you reindex, add this information to the demographics field. Once this data is in the index, personalizing search is as easy as boosting using the current user’s demographic data.

11.2. Personalizing search based on user behavior

In the previous section, we showed that you can learn about users by observing their behavior in the application. In this section, we take this notion to an extreme with collaborative filtering. This technique uses historical information about user-item interactions (views, ratings, purchases, and so forth) to find items that naturally clump together. For instance, collaborative filtering provides an algorithmic way to state that “users who purchase Barbie dolls will likely also be interested in girls’ dresses.” You can incorporate this information into search for an even more personalized search experience. We call this behavior-based personalization. In this section, you’ll walk through a basic collaborative filtering example and see how to incorporate it into search.

11.2.1. Introducing collaborative filtering

For behavior-based personalization, you narrow your focus. Rather than considering user demographics, search histories, and profiles, you focus solely on user-item interactions. In this section, you’ll look specifically at user-item purchases. In principle, interactions can be anything: item views, saves, ratings, shares, and so forth. Given the data set of user-item interactions, you’ll use collaborative filtering to reveal hidden relationships among users and the items.

Collaborative filtering comes in many forms, from simple counting-based methods (which we introduce in a moment) to highly sophisticated matrix decomposition techniques (which are outside of the scope of this book). But no matter the technique, the input to collaborative filtering and the output from collaborative filtering follow the same pattern.

As shown in figure 11.3, the input is a matrix representing the users’ interactions with items in the index. Each row corresponds to a user, and each column corresponds to an item. The values in the matrix represent user interactions. For the simplest case, the values of the matrix represent whether an interaction has taken place. For instance, the user has viewed or purchased a particular item. In the more general case, the values of this matrix can represent how positive or negative the user-item interactions are. The values can represent a user’s ratings of products purchased in the past, for example.

Figure 11.3. No matter the method used, collaborative filtering typically takes a user-to-item matrix and returns a model to quickly find user-to-item or item-to-item affinity.

Collaborative filtering outputs a model that can find which items are most closely associated to a given user or item. So, given a source item such as apple, the collaborative filtering model might return a list of items, such as banana, orange, or grape, for which apple has a high affinity. Additionally, each item includes an affinity score. Consider the output banana:132, orange:32, grape:11. Here banana has a relatively high affinity for apple, grape a low affinity.

11.2.2. Basic collaborative filtering using co-occurrence counting

To better understand how collaborative filtering works, let’s look at a simple example using a co-occurrence counting approach. The following algorithm is a bit naïve; we intend it to be introductory and don’t recommend that you implement it in a production system. Nevertheless, it builds up a basic understanding of collaborative filtering, and it removes the feeling that collaborative filtering is somehow magic. As you’ll see, many machine-learning algorithms are based on simple ideas, such as counting the number of times that items are purchased together.

Jumping into the example, let’s say that you work for an e-commerce website and you have a log of all items purchased across all users. Table 11.1 shows a sample.

Table 11.1. Log tabulating users’ purchases

Date

User

Item

2015-01-24 15:01:29 Allison Tunisia Sadie dress
2015-01-26 05:13:58 Christina Gordon Monk stiletto
2015-02-18 10:28:37 David Ravelli aluminum tripod
2015-03-17 14:29:23 Frank Nikon digital camera
2015-03-26 18:11:01 Christina Georgette blouse
2015-04-06 21:50:18 David Canon 24 mm lens
2015-04-15 10:21:44 Frank Canon 24 mm lens
2015-04-15 21:53:25 Brenda Tunisia Sadie dress
2015-07-26 08:08:25 Elise Nikon digital camera
2015-08-25 20:29:44 Elise Georgette blouse
2015-09-18 06:40:11 Allison Georgette blouse
2015-10-15 17:29:32 Brenda Gordon Monk stiletto
2015-12-15 18:51:19 David Nikon digital camera
2015-12-20 22:07:16 Elise Ravelli aluminum tripod

The first thing you must do is group all purchases according to user, as shown in table 11.2.

Table 11.2. The first step for determining item co-occurrence is grouping items by user. A dot (•) indicates a purchase.
 

Tunisia Sadie dress

Gordon Monk stiletto

Georgette blouse

Nikon digital camera

Canon 24 mm lens

Ravelli aluminum tripod

Allison        
Brenda        
Christina        
David      
Elise      
Frank        

It’s in this next step where all the “magic” happens. For any given item A, you count the number of times that the purchase of item A co-occurs with a purchase of any other item B by the same user. (Here, the term co-occurs doesn’t imply that the purchases were made at the same time, but that they were made by the same user.) You perform this calculation for every pair of items encountered in the purchase history. After collecting all the co-occurrence counts, you have a measure of the affinity between any two items A and B.

As a specific example based on the information in table 11.2, consider the relationship between the Canon 24 mm lens and other items in the index. You can see that only one individual, David, has purchased both the Canon lens and the Ravelli tripod; therefore, these items receive a co-occurrence count of 1. But two individuals, David and Frank, purchased both the Canon lens and the Nikon camera. The co-occurrence count for this pair of items is 2. And finally, no user has purchased both the Canon lens and the Tunisia Sadie dress. Therefore, this co-occurrence count is 0. After performing this calculation for every pair of items in the index, you arrive at the matrix of results displayed in table 11.3.

Table 11.3. Item co-occurrence counts for every item in the purchase history
 

Tunisia Sadie dress

Gordon Monk stiletto

Georgette blouse

Nikon digital camera

Canon 24 mm lens

Ravelli aluminum tripod

Tunisia Sadie dress - 1 1 0 0 0
Gordon Monk stiletto 1 - 1 0 0 0
Georgette blouse 1 1 - 1 0 1
Nikon digital camera 0 0 1 - 2 2
Canon 24 mm lens 0 0 0 2 - 1
Ravelli aluminum tripod 0 0 1 2 1 -

These values indicate the strength of associations between every pair of items. Notice that in this example, as expected, fashion items co-occur more highly with other fashion items. Similarly, photography items co-occur more highly with other photography items. In some instances, fashion items and photography items co-occur. This also is to be expected, because a few users are interested in both fashion and photography at the same time.

Item-to-item affinities can be directly used for item-based recommendations. The data shown in table 11.3 can be saved to a key-value store. Then, when a user visits the details page for the Ravelli aluminum tripod, you look up this item in the key-value store, pull back an ordered set of the corresponding high-affinity items (the Nikon digital camera and the Canon 24 mm lens) and present these items to the user as recommendations. As shown in figure 11.4, this is what Amazon does when it shows you its version of related item recommendations.

Figure 11.4. Item-to-item affinities can be used to make “related item” recommendations. When the user lands on the page for a Frigidaire microwave, you can display items with high affinity to the microwave in a panel similar to these recommendations from Amazon.

Taking the analysis one step further, you can find the affinity between users and the products in your catalog. To do this, refer to the user-item purchases in table 11.2 and, for every purchase made by that user, collect the corresponding item-to-item affinity rows and add them together. For instance, Allison bought the Tunisia Sadie dress and the Georgette blouse. Table 11.4 shows the corresponding rows from the co-occurrence matrix along with the sum of those rows.

Table 11.4. User-to-item affinities can be generated by adding together rows of the item-to-item matrix that correspond to a user’s purchases.
 

Tunisia Sadie dress

Gordon Monk stiletto

Georgette blouse

Nikon digital camera

Canon 24 mm lens

Ravelli aluminum tripod

Allison purchases Tunisia Sadie dress - 1 1 0 0 0
Allison purchases Georgette blouse 1 1 - 1 0 1
Summation 1 2 1 1 0 1

After you perform this summation for every user you’re interested in, you end up with a matrix like that shown in table 11.5. The values represent each user’s affinity to every item in the catalog.

Table 11.5. Complete user-to-item affinity matrix
 

Tunisia Sadie dress

Gordon Monk stiletto

Georgette blouse

Nikon digital camera

Canon 24 mm lens

Ravelli aluminum tripod

Allison 1 2 1 1 0 1
Brenda 1 1 2 0 0 0
Christina 2 1 1 1 0 1
David 0 0 2 4 3 3
Elise 1 1 2 3 3 3
Frank 0 0 1 2 2 3

Because you started with item purchases, it isn’t usually meaningful to track user affinities toward items that they’ve already purchased. It also isn’t meaningful to keep track of products that users have 0 affinity toward; why would you recommend users something that you don’t think they care about? So let’s remove these values and have another look at the remaining user-item affinity data, shown in table 11.6.

Table 11.6. User-to-item affinity matrix with purchased items and zero-affinity items removed
 

Tunisia Sadie dress

Gordon Monk stiletto

Georgette blouse

Nikon digital camera

Canon 24 mm lens

Ravelli aluminum tripod

Allison - 2 - 1 - 1
Brenda - - 2 - - -
Christina 2 - - 1 - 1
David - - 2 - - -
Elise 1 1 - - 3 -
Frank - - 1 - - 3

With the clutter removed, it’s easy to see that collaborative filtering works well. Just as in the previous item-to-item case, this information can be used directly for personalized recommendations. If only there was some way to incorporate this into your search application! Don’t worry; you’ll get there soon.

Looking at the data, you can see that fashion shoppers have highest affinity toward fashion items, and that photography shoppers have highest affinity toward photography items. But because one of the users, Elise, has interests in both photography and fashion, crossover recommendations exist between fashion and photography. Because of this, David will probably be confused when he gets a recommendation for the Georgette blouse. Fortunately, as the input data becomes richer (more items and more purchases per item), crossover recommendations such as this will become less prominent, and the user-item affinities will be dominated by the statistically significant co-occurrences.

Furthermore, in richer data sets, when unusual crossovers like this do exist, they’re often fortuitous because they point out a latent relationship among the catalog items. For instance, about the only thing that Mentos has in common with Diet Coke is that they’re both food (sort of). But toward the end of 2005, when the Mentos + Diet Coke experiment became viral on the internet, it became highly likely that these two items would show a spike in purchasing co-occurrence. This highlights the fact that collaborative filtering can identify connections that wouldn’t be obvious by looking only at the textual content of the documents.

As alluded to earlier, finding affinities in this way is a fairly naïve approach. For example, you haven’t normalized products that are extremely popular. Consider socks. No matter whether you’re interested in fashion, photography, or any other field you can think of, you still regularly purchase socks. Therefore, the co-occurrence count between socks and every item in the index will be very large; everybody will be recommended socks. To resolve this issue, you’d need to divide the co-occurrence values by a notion of popularity for each pair of items.

Co-occurrence-based collaborative filtering isn’t the only option for generating item-to-item or user-to-item affinities. If you’re considering building your own recommendations, make sure to review the various matrix-based collaborative-filtering methods such as truncated singular-value decomposition, non-negative matrix factorization, and alternating least squares (made famous in the Netflix movie recommendation challenge). These methods are less intuitive than the simple co-occurrence counting method presented here, and they tend to be more challenging to implement. But they often provide better results, because they employ a more holistic understanding of item-user relationships. To dive deeper into recommendation systems, we recommend Practical Recommender Systems by Kim Falk (Manning, 2016). And no matter the method you choose, keep in mind that the end result is a model that lets you quickly find the item-to-item or user-to-item affinities. This understanding is important as we explain how collaborative filtering results can be used in the context of search.

11.2.3. Tying user behavior information back to the search index

In the previous section, we demonstrated how to build a simple recommendation system. But we’re supposed to be talking about personalized search! In this section, we return to search and explain how the output of collaborative filtering can be used to build a more personalized search experience. We also point out some pitfalls to be aware of.

You can pull collaborative filtering information into search in several ways. The three strategies demonstrated here are related in that they take a standard, text-only search and incorporate collaborative filtering as a multiplicative boost. Here’s how it works: Consider an example base query in which you take the user’s query “Summer Dress” and search across two fields, title and description, as shown in the following listing.

Listing 11.1. Base query
{ "query": {
    "multi_match": {
      "query": "summer dress",
      "fields": ["title^3", "description"]}}}

Given this base query, you incorporate collaborative filtering by applying a multiplicative boost using a function_score query, as shown next.

Listing 11.2. A multiplicative boost can be used to incorporate collaborative filtering
{  "query": {
        "function_score": {
            "query": {
                "multi_match": {
                    "query": "summer dress",
                    "fields": ["title^3", "description"]}},
            "functions": [{
                "filter": { COLLAB_FILTER },
                "weight": 1.1}]}}}

In this simple implementation, the documents that get the collaborative filtering boost are determined by the contents of your COLLAB_FILTER filter (which we discuss in a moment). Notice that this filter doesn’t filter out any documents from the result set. Instead, documents matching this filter are given a multiplicative boost of 1.1, as indicated by the weight parameter. The query of listing 11.2 returns the same documents as the query of listing 11.1, but any documents also matching the COLLAB_FILTER get a 10% boost over the score of the base query. This subtly affects the ordering of the search results so that users making the query will see results that are more aligned with their previous behavior. This is the goal of personalized search.

Query-time personalization

With the basic structure in place, we can discuss the three strategies for incorporating collaborative filtering, each of which corresponds to a different COLLAB_FILTER and indexing strategy. For now, assume that the output of our collaborative filtering process is a set of user-to-item affinities—things like “Elise likes Tunisia Sadie dresses, Gordon Monk stilettos, and Canon 24 mm lenses.” But because we’re talking to machines here, Elise is user381634, the Tunisia Sadie dress is item4816, the Gordon Monk stiletto is item3326, and the Canon 24 mm lens is item9432. Further, assume that you have similar information for all users and all the products in your catalog.

Given this data set, the most straightforward approach for incorporating collaborative filtering is as follows: Start by storing collaborative filtering data in a key-value store (outside the search engine). Let’s say that Elise, user381634, has high-affinity items: item4816, item3326, and item9432. Now, next time Elise uses the search engine, the first step is to retrieve her high-affinity items from the data store, and then referring again to listing 11.2, replace COLLAB_FILTER with a filter to directly boost her high-affinity items by ID:

COLLAB_FILTER = {
  "terms": {
    "id": ["item4816", "item3326", "item9432"]
  }
}

Then any item matching Elise’s high-affinity items will be driven further toward the top of the search results.

Although this is the most obvious approach, it might become computationally expensive at query time. In the preceding example, Elise has only three high-affinity items. In a more realistic implementation, a user could have hundreds or potentially thousands of high-affinity items. And at some point, having too many terms ORed together like this makes for slow queries. But you might be surprised with the extent to which this approach will scale. For instance, consider that Lucene does quite well with geo search. But as we discussed in chapter 4, under the hood geo search is implemented by ORing together many, possibly hundreds, of terms that represent a geographic area. Besides, in many personalized search applications, you won’t need thousands of high-affinity items anyway; a user’s searches will often tend toward a relatively small domain of interest. A few few hundred high-affinity items in this domain will likely provide users with a noticeably personalized search experience.

Index-time personalization

If your application can’t afford the performance hit at query time, the next approach places the burden on the index size. A benefit of this technique is that there’s no need for an external key-value store, because you’ll save collaborative filtering information directly to the index.

To do this, you add a new field to the documents being indexed named users_who_might_like. As the name indicates, this field contains a list of all users who might like a given item. For example, when you index the Gordon Monk stiletto, you include all the typical information that you need for search: title, description, price, and so forth. But this time you also include a users_who_might_like field, which is a list of all users showing a high affinity to this item. Referring to table 11.6, you can see that both Allison (user121212) and Elise (user989898) have a high affinity to this item. In this case, the users_who_might_like field will be the list user121212, user989898.

After all documents are indexed with their corresponding users_who_might_like field, the rest is easy. At query time, when Allison (user121212) makes a search, you issue her query along with a simple boosting filter:

COLLAB_FILTER = {
  "term": {
    "users_who_might_like": "user121212"
  }
}

Again, this returns the same results as the base query, but any document that includes Allison as a “user who might like” gets a 10% boost—pushing it toward the top of the search results. You can see that this query is much easier on the search engine at query time, because there’s only one extra term. But with this method, you must be watchful of the index size. As a frame of reference, in a modest Lucene index of 500,000 one-page English text documents, you might expect something like 1 million unique terms in the index. With this approach, each unique user ID represents another term in the index. So you should expect this approach to scale well for hundreds of thousands of users. But if you have millions of users, your index may outgrow its servers. Fortunately, this method scales well horizontally. You can create shards that represent a portion of your customers. If you have millions of users, you probably have the resources for this.

Index- and query-time personalization

Our final approach splits the difference between the two preceding approaches. Previously, you used user-to-item affinities, but for this last approach you’ll assume that the output of the collaborative filtering is a set of item-to-item affinities like those shown in table 11.3. In this new approach, the search engine calculates user-to-item affinity implicitly at the time of the query.

The setup for this approach is more involved than in the other approaches. You’ll again require a key-value store to look up user-related information. But this time rather than storing user-to-item affinities, you store the users’ most recent purchases. You also add a new field, related_items, to the index. As the name suggests, this field will contain a list of IDs for high-affinity items. At query time when Frank makes a search, you first pull his recent purchases—a Nikon digital camera (item1234) and a Canon 24 mm lens (item9432)—and then you issue his query along with the following boosting filter:

COLLAB_FILTER = {
  "terms": {
    "related_items": ["item1234", "item9432"]
  }
}

You query using a list of IDs just as in the first method, but it’s a much shorter list. And as in the second method, you’re required to index an extra field with a list of IDs, but unless you have millions of items, this will also be much less information than in the previous method.

As mentioned at the opening of this section, the preceding methods are just a few of the many ways that collaborative filtering can be incorporated into search. And you can improve these methods in many ways. For instance, you probably noticed that we didn’t mention the affinity values in any of these solutions. Instead we lumped items into two groups: high-affinity items (matching the COLLAB_FILTER filter) and lower-affinity items. You can modify these methods to take the individual affinity values into account, but this requires creativity involving payloads and scripting or possibly even a custom search-engine plugin.

11.3. Basic methods for building concept search

Personalized search is just one of the many possible directions to explore outside the more standard approaches presented in the previous chapters. Another interesting extension of search is concept search. Before reading this book, you probably thought of search as the process of finding documents that match user-supplied keywords and filters. Hopefully, you’ve come to realize that a good search application works to infer the user’s intent and provide documents that carry the information that the user seeks. Concept search takes this notion to an extreme.

The goal of concept search is to augment a search application so that it in some sense understands the meaning of the user’s query. And because of this understanding, the documents returned from a concept search may not match any of the user’s search keywords, but will nevertheless contain meaningful information that the user is looking for. To borrow a phrase coined by Google, the goal of concept search is to allow users to “search for things, not strings.”

Perhaps an example will help bring home the need for concept search. Consider a search application for medical journals. Using a typical string-based approach, a search for “Heart Attack” would fall short of the ideal. Why? Because medical literature uses various words for heart attack, such as myocardial infarction, cardiac arrest, coronary thrombosis, and many more. Plenty of articles about heart attacks won’t mention heart attack at all. Concept search provides the user with an augmented search experience by bringing back documents that talk about heart attack even if they happen to not contain that specific phrase.

Still—and we can’t emphasize this enough—a search engine at its core is a sophisticated token-matching and document-scoring system. The crux of concept search isn’t magic; it involves augmenting queries and documents to take advantage of new relevance signals that increase search recall. By carefully balancing these new concept signals, you can ensure that search results retain a high level of precision. In this section, we cover several human-driven methods for augmenting your search application to take on a more conceptual nature.

11.3.1. Building concept signals

Initially, you may reach toward human-powered document tagging to implement concept search. You can create a field that will serve as a dumping ground for terms and phrases that answer the question “What is this document about?” This field will be the home of your new concept signal.

With this approach, when users of the medical journal application search for “Heart Attack” but miss an important article, you add the phrase heart attack to your concept field. Thereafter the document will be a match. This approach also helps fine-tune a document’s score. For instance, let’s say you have an important article about heart attacks. It may even contain the phrase heart attack. Unfortunately, it shows up on the second page of search results. Rather than attempt to solve the problem globally, add the phrase heart attack to the concept field (maybe even add it multiple times). This nudges the score for that document just a little bit higher whenever the user searches for “Heart Attack.”

But be forewarned that human curation can be challenging and resource consuming. Accurate tagging requires extensive domain expertise and rigorous consistency. For example, should the heart attack query also be tagged with heart? In the domain of your users, does acute heart attack differ from heart attack? Should a document receive both tags? Only trained, domain-aware content curators can make these fine-grained distinctions. Tagging also takes a lot of human effort. It may require deep reading of the content, which may not scale to cover realistic data sets.

One way to reduce the curation workload is by looking to your users as a possible way to crowd-source the concept signal. Do you recall the conversation about thrashing behavior in chapter 9? When thrashing, an unsatisfied search user quickly moves from one search to another, indicating that the search results don’t match their intent. Imagine that a user searches for “Myocardial Infarction,” spends about 20 seconds on the results page, and then makes a new search for “Cardiac Arrest.” It’s obvious that this user isn’t finding what they are looking for.

But often these users do finally find a relevant search result. Once there, it’s as if the user tells you, “Hey, remember all that other stuff I was searching for? This is what I meant!” Imagine that in our example, the user still doesn’t find anything interesting in the cardiac arrest results and submits a third query—this time for “Heart Attack.” Upon seeing the results, the user clicks the second document in the result set and then doesn’t return to search. This user implicitly tells us that the phrases heart attack, cardiac arrest, and myocardial infarction are somehow related. Therefore, why not take the thrashing search terms and add them to the concept field for the document that finally satisfies the user’s information need? This way, the next time someone follows the same path as our thrashing user, they will more likely find what they need in their first set of search results.

Again, the main goal of the new concept field is to increase search recall. But you should be careful to not ignore the impact made upon precision. In the preceding example, if the user initially searches for “Cardiac Arrest” but then changes to “Gall Bladder,” your concept signal may become noisy. Make sure to properly balance the concept signal with the existing signals. If the concept field is human curated, it’s more likely to be high quality than a user-generated field and should be more strongly represented in the relevance score.

11.3.2. Augmenting content with synonyms

Synonym analysis is another useful way to inject deeper conceptual understanding into search. The first time you open the synonym file and add an entry such as the following, you’re building out concept search:

TV, T.V., television

When the user asks the search application for a “TV,” the search application answers back, “I have documents with TV—but I bet you’re also interested in these other documents that have words like T.V. and television, right?”

Initially, synonym augmentation of the documents takes place somewhat manually. Content curators may hand-generate an extensive synonym list customized to the jargon of a field. In larger domains—medicine is again a good example—it may be possible to repurpose a publically available taxonomy such as the Medical Subject Headings (MeSH) for use during synonym analysis.

One thing to think about when using synonyms is whether to encode hierarchical structuring with the synonyms. For instance, in the preceding simple case with television, all the synonym entries are semantically on the same level; they truly are synonyms. But, as discussed in chapter 4, it’s common to use synonyms to encode a notion of specificity into the indexed terms. For instance, consider the following synonym entries:

marigold => yellow, bright_color
canary => yellow, bright_color
yellow => bright_color

This encodes a hierarchy for yellow things (and you can imagine what a much larger synonym set for all colors would look like). When used correctly, synonyms like this will serve to expand a user’s narrow query, for instance “Canary,” to a broader notion: yellow. This improves recall, allowing the user to find items that use more-general terminology.

As always, recall and precision must be balanced. Fortunately, the natural TF × IDF scoring works in this case. Namely, after synonym analysis has been applied, specific terms such as marigold will occur much less often in the index than terms like yellow and bright_color. Therefore, when a user searches for a specific term such as “Marigold,” both marigold and yellow documents will be returned, but marigold documents will be scored above the more general yellow documents.

As a final note, synonym augmentation and concept fields are complementary approaches. Synonym analysis usually dumps the synonyms back in the same field as the source text that was analyzed. But if you’re concerned that your synonyms are noisy, it might be a good idea to stick them into a separate field so that they can be given a lower weight. Another good combination of concept fields and synonym augmentation occurs in document tagging. In this scenario, you can greatly reduce the burden placed on the people doing the tagging by having them apply only the most specific tags. Then hierarchically structured synonyms can be used to automatically augment the documents with less specific tags.

11.4. Building concept search using machine learning

In the earlier sections, we introduced personalized search using simple methods and then moved on to more sophisticated, machine-learning approaches. Here we follow this same route, moving from human-powered concept management to a machine-learning approach that we call content augmentation.

Just as before, the goal is to include a new content signal into the documents to improve search recall. In particular, you’ll use machine learning to automatically generate pseudo-content to be added back into the indexed document. This content won’t be new paragraphs of human-readable text. Rather, the pseudo-content will be a dump of terms that aren’t in the original document but that, in some statistically justifiable sense, should be present because they pertain to the concepts in that document.

To generate the new pseudo-content, you algorithmically model the statistical relationship between words based on the documents that contain them. For instance, consider a medical journal article that contains the word cardiac. There’s a high probability that the same article will also contain words like heart, arrest, surgery, and circulatory; these words are related to the cardiac topic. But it’s unlikely that the same article will contain the words clown, banana, pajamas, and Spock; these words have little in common with the cardiac topic. By looking at the co-occurrence of words, you can begin to understand how they’re interrelated. And once you have a good model of these relationships, you can take any given document and generate a set of words that in some sense should be in that document.

Let’s look at an extremely simplified example. Consider the small set of documents displayed in listing 11.3. Each document is a sentence ostensibly about dogs or cats. If you put these documents through analysis, you can split out the tokens, normalize them (by lowercasing and stemming), and filter out the common stop words. The end result can be represented in matrix form, as shown in table 11.7. Here a dot (•) indicates that the term (represented in that column) has occurred one or more times in the document (represented in that row).

Table 11.7. Matrix representation of the words and the documents that contain them. (Stop words have been removed and plural words have been stemmed. Additionally, the columns are arranged to draw attention to statistically clustered words.)
 

dog

happy

friendly

cat

fluffy

sly

doc1        
doc2      
doc3          
doc4        
doc5      
doc6        
Listing 11.3. Simple documents illustrating term co-occurrence
doc1: The dog is happy.
doc2: A friendly dog is a happy dog.
doc3: He is a dog.
doc4: Cats are sly.
doc5: Fluffy cats are friendly.
doc6: The sly cat is sly.

You may notice that this term-document matrix resembles table 11.2’s user-items matrix. This is no coincidence. The problem of identifying related items based on user interactions is nearly identical to the problem of identifying related terms based on document co-occurrence. Whereas the earlier personalization example revealed natural clustering among fashion items and photography items, table 11.7 reveals a clustering among dog terms and another clustering among cat terms. In principle, you could even use the same co-occurrence counting method to identify closely related terms. But in practice, more-sophisticated methods are used such as latent semantic analysis, latent Dirichlet allocation, or the recently popular Word2vec algorithm. Though these methods are well beyond the scope of this book, the models generated from these algorithms allow you to “recommend” pseudo-content for documents in much the same way that section 11.2.2 showed how to recommend products based on user-to-item and item-to-item affinity.

After automatically generated pseudo-content is indexed with each document, queries can match documents that didn’t originally contain the user’s search terms. Nevertheless, these documents will be strongly associated with the keywords based on statistical term co-occurrence. In the preceding example, a search for “Cat” may return a document that talks about a sly fluffy animal even if that document doesn’t contain the word cat.

11.4.1. The importance of phrases in concept search

We haven’t discussed an important component of concept search: phrases. Often phrases carry meaning that’s more specific than the terms that compose the phrase. Case in point: software developer. A software developer isn’t software, but a person who develops software. Additionally, there are many types of developers, and a land developer, for example, has nothing to do with software development.

Therefore, prior to content augmentation, it’s useful to first identify statistically significant phrases within the text. These phrases can be added to the columns of the term-document matrix so that during content augmentation these conceptually precise phrases will be included in the newly generated content.

Collocation extraction is a common technique for identifying statistically significant phrases. The text of the documents is split into n-grams (commonly bigrams). Then statistical analysis is used to determine which n-grams occur frequently enough to be considered statistically significant. As is often the case, this analysis is a glorified counting algorithm. For instance, if your document set contains 1,000 occurrences of the term developer, and if 25% of the time the term developer is preceded by the term software, the bigram software developer should probably be marked as a significant phrase.

11.5. The personalized search—concept search connection

As we earlier indicated, a strong relationship exists between personalization and concept search. Both rely on crafting new signals to improve precision and recall. Both use similar machine-learning approaches. But the relationship goes even deeper because these methods can be used together to improve relevance even further.

Consider the cold-start problem. Let’s say that you’re trying to build personalized search based on collaborative filtering. What happens when a new item is introduced to your catalog? Recall that collaborative filtering methods depend on user interactions. Because no one has ever interacted with the new item, no personalization happens. You can’t recommend a new item based on behavioral patterns that don’t exist; this is known as the cold-start problem.

This begs an important question, though. You do at least have some information for the item: its textual content. Can this be used to generate personalized recommendations? Yes! To do this, you incorporate aspects of concept search into your personalization strategy. With concept search, you augment documents with a broader, conceptual understanding of the text. When pulling concept search into personalization, you must instead augment your user profiles to track the concepts that they’re interested in. And in this case, just as in the preceding section, concepts are the important words and phrases that a user has shown high affinity toward.

There are plenty of ways to determine content that holds high affinity to your users. One way is to turn back to machine learning and somehow infer user-to-term affinities based on the user’s interactions with documents and the text of those documents. But there’s no need to get overly sophisticated; users are constantly feeding us high-affinity terminology in the searches that they make. And if you’re fortunate enough to have highly engaged users, you may even be able to directly ask what types of content they’re interested in.

Now, reversing the perspective, consider how the behavioral information used with collaborative filtering can also be used to augment concept search. In section 11.2, you saw how collaborative filtering could establish a relationship among camera equipment and a different relationship among fashion items. These relationships were established based solely on user behavior; the items’ textual content played no role. Nevertheless, as this example demonstrates, behaviorally related items are often conceptually related as well. Therefore, when the textual content associated with an item is weak, behavioral information can be employed to help users find what they’re looking for.

11.6. Recommendation as a generalization of search

Throughout this book, we’ve covered the ins and outs of search. We’ve pulled open the search engine, explained the inner workings, and built techniques for producing a highly relevant search application. We discussed business concerns, describing how to shape a culture that makes search relevance a central issue. In this chapter, we’ve pointed to ways to imbue search with an almost spooky ability to understand the user’s intent.

But here, at the end of this last chapter, we expose a new challenge to everything we’ve written to this point:

Maybe search is not the application you should be building. Maybe you should be building recommendations.

Consider what happens if there’s no explicit search in a personalized, concept-based search application. What if the search box is left empty and filters left unchecked? Can the application still be put to use? Yes! Even without immediate input from the user, the application has a significant amount of context that can be used to make rich recommendations. For instance, if the user looks at an item detail page, then the application can use methods discussed in this chapter to recommend related items. If the user interacted with the application in the past, the recommendations can incorporate the user’s behavioral and conceptual information so that the recommendations will be personalized. So you see, recommendation is something that can still exist without an explicit query from the user. As we’ll show in the following paragraphs, it may even be useful to think of search as a subset of recommendation.

Let’s dig farther. Think about the analogues to search and recommendation that exist in real life. When viewed in its worst possible light, a basic search application can be like a gum-chewing, disinterested, teenage store clerk. You say, “I need a shirt,” and the clerk points to a wall of shirts. There are hundreds of shirts—a mixture of all styles, sizes, and prices imaginable. It’s too much to process, so you attempt to filter the search: “Yeah, but do you have anything in size M?” The clerk (still smacking gum) glances up and points to the bottom rack. There are still a lot of shirts to choose from—a mixture of various styles and prices—but you need a new shirt, so you walk over and start looking through the shirts in your size.

Even though we’ve couched this story in terms of a search for a new shirt, the store clerk is effectively making recommendations to the customer. They’re just not particularly good recommendations. The clerk ignores the other personalization and conceptual context clues that could help direct customers to just the item they’re looking for.

Continuing with the analogy, let’s replace the gum-chewing store clerk with your own personal fashion consultant. This time you walk into the store and say to the fashion consultant, “I need a shirt,” and the fashion consultant takes you directly to the shelf with shirts that match your size. The fashion consultant is a well-studied expert, keenly aware of the types of clothing in style and how to pair clothing items to make a good outfit. She’s also keenly aware of your personal style. The fashion consultant busies herself looking through the rack, pulling out the shirts that are a good match, and arranging them for you to look through yourself. Then she looks up and asks, “Oh, what price range are we looking in today?”—extra context. Upon hearing your response, she plucks out a couple of the overpriced shirts and hangs them back on the rack. Finally, she helps you look through the remaining items.

Now you’re getting somewhere. The attentive and highly educated fashion consultant doesn’t leave you wandering aimlessly to search for a shirt by yourself, but works with you to provide recommendations that take into account information about both you and the fashion domain. And you know what? After helping you pick out that shirt, the fashion consultant takes you over to the hat rack beside the register and says, “Check out this hat. This is a perfect match for you and would look great when you’re wearing that new shirt.” And she’s right; it’s a cool hat! This is the epitome of recommendation, because, even without making an explicit search, the fashion consultant is ready to provide feedback based on whatever information is at hand.

11.6.1. Replacing search with recommendation

As the preceding story illustrates, recommendation could be seen as an overarching and unifying concept—which happens to include the notion of search. Here’s a formal definition:

Recommendation is the ability to provide users with the best items available based on the best information at hand.

The most interesting part of this definition is the word information. Here, information comes in three flavors: information about the users, about the items in the catalog, and about the current context of recommendation:

  • User information —As users interact with the application, you can identify patterns in their behavior and learn about their interests and tastes. Particularly engaged users might even be willing to directly tell us about their interests.
  • Item information —To make good recommendations, it’s important to be familiar with the items in the catalog. At a minimum, the items need to have useful textual content to match on. Items also need good metadata for boosting and filtering. In more advanced recommendation systems, you should also take advantage of the overall user behavior that gives you new information about how items in the catalog are interrelated.
  • Recommendation context —To provide users with the best recommendations possible, you must consider their current context. Are they looking at an item details page? Then you should make recommendations for related items in case they aren’t sold on this one. Is the user getting ready to check out? Then let’s recommend popular, low-cost items. Are you sending out an email newsletter? Then let’s show the users some highly personalized recommendations and see if you can bring them back on the site.

You might notice that search is barely mentioned in this discussion. What gives? Is it just ... gone? Quite the opposite! Search is still present; it’s just another one of the possible contexts for recommendation. And as a matter of fact, search is the most important context, because it represents users telling you exactly what they’re looking for right now. When a user makes a search, you have access to the richest information possible and should therefore be able to make better-informed recommendations. To pick back up with the fashion consultant example, search is the point where you tell the consultant, “You know what? Today I’m looking for an Hawaiian shirt.” And the consultant recommends a shirt that not only matches the current search context (it’s Hawaiian), but also matches your established personal preferences.

11.7. Best wishes on your search relevance journey

We’ve finally come to the close of this book. Before you leave, consider how far you’ve come:

  • Chapter 1 helped familiarize you with the problem space and the challenges that you’ll likely encounter as you work to improve your own search relevance issues.
  • Chapter 2 laid the foundation for technical discussions in the book by explaining how search technology works—inside and out.
  • Chapter 3 introduced debugging tools useful for isolating a wide range of relevance problems.
  • Chapter 4 described how text is processed in order to extract the features most useful in search.
  • Chapters 5 and 6 discussed how textual features are used to build higher-level relevance signals and the various ways that these signals can be combined.
  • Chapter 7 explained how functions and boosting are used to further tune relevance and to shape search results to achieve business goals.
  • Chapter 8 revealed that relevance is more than just tuning parameters; it’s also about helping users understand and refine the information being made available to them.
  • Chapter 9 provided an end-to-end relevance case study that combines the lessons of the previous chapters and outlines a systematic approach to designing relevant search applications.
  • Chapter 10 described how to shift the organizational culture to focus on relevance.
  • Chapter 11, this chapter, broadened the horizons of search to include personalized search, concept search, and recommendations.

After covering all these topics, we’re sure that you’ll find—and are probably already finding—that each search application has its own set of relevance challenges. But after reading through this book, you should find yourself much better equipped to meet and overcome these challenges.

11.8. Summary

  • Personalized search provides a customized experience for individual users and allows high-affinity items to be boosted toward the top of search results.
  • Build basic search personalization by creating user profiles that track preference and demographic information. Make sure that this information is represented in the search index.
  • Use collaborative filtering to create personalized search based on the user behavior.
  • Concept search provides users not only with documents that match their search terms but also with documents that match the meaning of their search.
  • Concept search requires adding new content to the index or to the users’ queries in order to improve recall. It’s important to carefully balance the new concept signals in order to retain precision.
  • Use personalized search and concept search together to complement and augment one another.
  • Personalized and concept search without an explicit user query is effectively the same thing as a recommendation.
  • Consider recommendation as a generalization of search.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.166.74