7 User Preferences and Recommendations

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Pervasive applications assist human users to perform various tasks. In particular, these applications should become smart in the sense that even though the users may not explicitly specify their needs, the applications can learn from past interactions with users and adapt their behaviors in the future to provide customized services. In order to achieve this goal, pervasive applications actually learn user preferences from user interactions and make recommendations based on the learned preferences.

A recommendation is a way to help people find information to suit their needs and is widely used by many online services for suggesting products to customers that they might like to buy. For instance, Amazon uses an item-to-item collaborative filtering (CF) algorithm as their underlying recommender system [1]. Google uses CF to generate personalized recommendations for Google News users [2].

Previous recommender systems can be categorized into two types: content-based approaches [3] and collaborative filtering [4]. Content-based recommendations offer items to a user that are similar to past favorite. CF can be item-to-item (people who buy X also buy Y) or user-based (recommended by friends or users with similar interests) or hybrid approaches. Unlike content-based approaches, CF approaches make predictions from large-scale item–user matrices. Generally, CF can be classified into memory-based and model-based approaches. Memory-based schemes achieve high accuracy by exploiting similarity of functions between items and users [5–8]. Model-based CF approaches [1,9,10] first cluster all items or users into classes via machine learning algorithms and then use these classes for prediction. Recently, CF approaches based on matrix factorizations have been shown to be very effective for Netflix competitors [11,12].

This chapter is organized as follows. Section 7.1 discusses an RSS (really simple syndication or rich site summary) reader application that uses content-based recommendations. Then, we describe a CF algorithm that makes recommendations based on similar users and items in Section 7.2. Finally, Section 7.3 illustrates a top-K recommendation problem for social networks, where user preferences are computed as latent features, and a Monte Carlo simulation is utilized for making recommendations.

7.1 Content-Based Recommendation in an RSS Reader

This section discusses a content-based recommendation in an RSS reader, where recommendations are made according to user preferences captured during the use of the system. RSS is a technology that has been developed so that a site can create feeds for subscribed users to receive most recent updates. An RSS feed is an XML document containing information such as a title, full or summary text, publishing date, authorship, and URL, of new postings from a specific Web site. We call a posting in an RSS feed an item, which can be a piece of news, a blog, a tweet, or even a picture. To help users manage and read RSS feeds, many RSS readers [13] are developed that are capable of repeatedly fetching all RSS feeds and automatically checking for updates.

Currently, most RSS readers sort new items in the order of publishing date, which makes it hard for users to find information they like to read. The problem is that users may subscribe to a number of RSS feeds that may produce thousands of new items each day. As a result, it is difficult for users to find the most interesting items. The InterSynd system [14] recommends new RSS feeds (not new items) for users based on their neighbors’ subscriptions. NectaRSS [15] is another RSS reader that builds a user’s profile based on his browsing history and recommends items that are close to that profile—an approach based on text similarity. CRESDUP [3] also recommends advertisements for RSS feeds using text similarity. Our work differs from NectaRSS and CRESDUP in that we study a number of different features for RSS recommendation as well as the effectiveness of different features and feature combinations.

7.1.1 Data Collection

We have designed and implemented a Web-based RSS reader that uses YUI [16] to provide a more user-friendly Ajax-based user interface. Figure 7.1 shows the user interface of our RSS reader. In our RSS reader, users can add an RSS feed by providing the URL of the feed and then tagging on subscribe feeds. In the all items view, users can see a ranked list of items from all RSS feeds, with recommendations made by the system being displayed at the top of the page. After reading an item, users can click on a smile icon below the text of the item to mark the item as their favorite. In this way, we know the items that users like. To find new items, our system repeatedly fetches RSS feeds every 30 minutes so that it is unlikely to miss an item. Actually, 30 minutes is a very short time period; Google Reader checks for updates about once per hour [17].

Figure 7.1 User interface of our RSS reader.

When new items are downloaded, our RSS reader ranks them according to users’ preferences and shows the ranked results for users. The idea is that users can see items they are interested in in top positions. Previous research [3,15] primarily has adopted text similarity for recommendations, but we also consider other features, such as users’ past preferences for a feed (favorite fraction), update frequency of a feed, and PostRank [18] values. We study the effectiveness of recommendations for different features and feature combinations. The experimental results show that favorite fraction is the single most important feature, and the combination of text similarity and favorite fraction performs better than other combinations.

We invited 20 volunteers to use our Web-based RSS reader. Each time a user logs in, the reader creates a session and randomly chooses a method to rank all the new postings. The ranking methods include ordering by a single feature (publishing date, text similarity, and PostRank) and ordering by feature combinations. We only tell the volunteers that the system can recommend items for them, but they are unaware of any specific ranking method used in the session. In this way, they will not have bias toward any of the ranking methods. Each time a user marks an item as read or as his favorite, that item’s information will be saved so that we can easily recalculate the ranking using other methods in the future.

In total, the system recorded 102 distinct sessions. Users read 6073 items using our RSS reader and marked 418 items as their favorites (about 6.9%). In sessions in which users read at least 100 items, the average number of favorite items was 7.8 per session.

7.1.2 Recommendation Features

We consider a number of features in our recommendation algorithms.

■ Similarity. This feature represents the similarity between an item and the preference of a user. We follow the traditional information retrieval technique of using a vector space model to represent a text item or a user preference. Then, the similarity is calculated as the cosine value between two vectors.

■ Document vector. The document vector of an item in an RSS feed is constructed with a TF-IDF weighting method [19]; that is,

$w_{d} = (w_{t_{1}}, w_{t_{2}}, \dots, w_{t_{m}})$ $w_{d} = (w_{t_{1}}, w_{t_{2}}, \dots, w_{t_{m}})$

(7.1)

where w_d denotes the document vector for document d, and $w_{t_{i}}$ $w_{t_{i}}$ denotes the TF-IDF value of term t_i in document d. Stop words are removed from the term vocabulary. Term frequency (TF) is calculated using maximum TF normalization (NTF):

$n t f_{t, d} = α + (1 - α) \frac{t f_{t, d}}{t f_{m a x} (d)}$ $n t f_{t, d} = α + (1 - α) \frac{t f_{t, d}}{t f_{m a x} (d)}$

(7.2)

where tf_t,d is the number of occurrences when term t appears in document d, and tf_max(d) is the maximum tf in document d. NTF is chosen over TF with the following considerations: The length of an RSS item has wide variations, resulting in large TF values for longer documents. This will give an unfair bias toward longer documents. On the other hand, NTF scales down TF values with maximum TF in the document and introduces parameter a for smoothing; thus, it is more effective for RSS feeds. The constant α in the equation is set to 0.5 as suggested in one study [19].

■ User preference. A user’s preference is constructed from items that have been marked as favorites by a certain user. We only consider the recent N favorite items of the user because a user’s interests may vary over time. Specifically, a user’s preference is defined as the centroid of his favorite items:

$P_{u} = \frac{1}{N} \sum_{d \in F_{u} (N)} w_{d}$ $P_{u} = \frac{1}{N} \sum_{d \in F_{u} (N)} w_{d}$

(7.3)

where P_u denotes the profile of user u, and F_u(N) denotes the most recent N items that the user u marks as a favorite. We set N to 100 so that we have enough samples from the user and so that calculating the centroid vector is computationally efficient. The similarity between user profile and a new item (Sim) is calculated as their cosine distance:

$S i m (P_{u}, w_{d}) = \frac{P_{u} \cdot w_{d}}{| P_{u} | \cdot | w_{d} |}$ $S i m (P_{u}, w_{d}) = \frac{P_{u} \cdot w_{d}}{| P_{u} | \cdot | w_{d} |}$

(7.4)

An item with high Sim is more relevant to a user profile, and the user may be interested in reading the item. Conversely, low Sim means irrelevant. Sim ranges from 0 to 1.

■ Favorite fraction. The favorite fraction (FF) feature measures the degree that a user may like an RSS feed. Specifically, the FF value for user u on feed i is defined as:

$F F (i, u) = \frac{| S_{_{f a v}} (i, u) |}{| S_{r e a d} (i, u) |}$ $F F (i, u) = \frac{| S_{_{f a v}} (i, u) |}{| S_{r e a d} (i, u) |}$

(7.5)

where |S_fav(i,u)| denotes the total number of items in feed i that user u has marked as his favorites, and |S_read(i,u)| denotes the total number of items in feed i that user U has read. FF ranges from 0 to 1. FF shows the affection of a user to a specific feed. With a high FF value, the user will probably like new postings from the feed. For instance, a person may prefer The New York Times to the Los Angeles Times but subscribes to RSS feeds from both. With the FF feature, we can capture such a user preference on different RSS feeds.

■ Inverse update frequency. Update frequency measures the productivity of a feed. This is a feature solely related to RSS content providers. For many feeds, low productivity often leads to high quality. For example, the news RSS feeds from Sina.com publish hundreds of new items per day. It is unlikely that a user will be interested in all these new items. On the other hand, a blogger may only write a few blogs each day, and these blogs are of high quality to the user. Thus, we design an inverse update frequency (IUF) feature to capture the productivity of RSS feeds, which is defined as follows:

$I U F (i) = 1 - \frac{n (i)}{n_{\max}}$ $I U F (i) = 1 - \frac{n (i)}{n_{\max}}$

(7.6)

where n_max denotes the maximum number of updates per day of all the feeds, and n(i) is the number of new items per day of feed i.

■ PostRank. The PostRank feature is obtained from PostRank.com [18], an online RSS reader. For each new RSS item, PostRank.com assigns a PostRank value (PR) according to the information about the item retrieved from other Web sites, such as the number of diggs and comments at Digg.com, how many people have bookmarked the item, whether or not the item is mentioned in Twitter or FriendFeed, and the number of clicks made by users using the PostRank Reader. A higher PR value means more people are talking about the item or more people like it. Thus, items with high PR are more likely to be of good quality or important. The original PR value ranges from 0 to 10, and we normalize it to be from 0 to 1.

7.1.3 Feature Combinations

The aforementioned features—text similarity, update frequency, favorite fraction, and PostRank—all have their merits as well as some shortcomings:

■ Text similarity may work well when a new item is close to the user preference—items that the user has marked as favorites before. However, similarity does not work well with new topics or new events that the user has never seen before or in cases where user interests have shifted.

■ FF always recommends items in feeds that users have read and liked in the past, but other feeds that user may not have read may be ignored.

■ High IUF or PR probably indicates items of high quality. But these two features have nothing to do with a user’s preferences. In some cases, users may have little interest in some items with high IUF or high PR scores.

We consider ways of combining these features for better recommendations because these features present different aspects of a new item. Specifically, we adopt a linear combination of two features for ranking different items:

$R a n k = (1 - k) f_{1} + k f_{2}$ $R a n k = (1 - k) f_{1} + k f_{2}$

(7.7)

where Rank is the final ranking score for an item; f₁ and f₂ can be any feature values of Sim, IUF, FF, and PR; and k is a constant value. For instance, we could set k to 0.3, f₁ to Sim, and f₂ to FF.

7.1.4 Performance

We assume that (1) users will like a new item no matter which position it is in and (2) users are focusing on reading while using the RSS reader. Thus, when a user marks certain items as favorites, we can expect that the user will like the same items using other ranking methods. In other words, we can re-rank items with different ranking algorithms and evaluate their performance with previously recorded session data. To make sure that a user was concentrating on a session, we only perform re-ranking on sessions with more than 100 items. We use the following two metrics in our evaluation:

■ Average favorite. For sessions of more than 100 items, we re-rank all items in each session with different algorithms, and then we divide the first 100 items into five equal-sized bins—each bin consisting of 20 items. The average favorite (AF) of bin j for a specific ranking algorithm is defined as:

$A F_{j} = \frac{\sum_{u} \sum_{t} f a v (u, i, j)}{\sum_{u} \sum_{t} 1}$ $A F_{j} = \frac{\sum_{u} \sum_{t} f a v (u, i, j)}{\sum_{u} \sum_{t} 1}$

(7.8)

where fav(u,i,j) is the number of favorites of user u in session i and bin j.

If, on average, most of the favorite items appear in bin 1, and few appear in other bins, then the ranking algorithm performs well. An algorithm performs poorly if favorite items in different bins are almost the same, or it performs even worse when favorites in the first bins are less than the latter bins.

■ Precision at N. Precision at N measures the accuracy of the top N items for a specific ranking algorithm. We group all items of all sessions together and re-rank them with different algorithms. Given a number N, we calculate a specific algorithm’s precision at N as:

$P r e c (N) = \frac{number of favorites in top N}{N}$ $P r e c (N) = \frac{number of favorites in top N}{N}$

(7.9)

A ranking method works well if the precision at N is high.

■ Single feature. This experiment studies the effectiveness of individual features for RSS recommendation. Figure 7.2 illustrates the average number of favorite items in different bins (i.e., the values of AF_j using different features). Among the four proposed features, FF and IUF perform better than Sim and PR. Specifically, FF has more than half (3.81) of the favorite items in the first bin, indicating that users do have a preference in RSS feeds. For Sim and IUF, both have decreasing AF values over bins. For ranking by age and PR, we can observe that the variation over five bins is not significant, indicating neither feature is effective in recommendations.

■ Feature combinations. This experiment studies the effectiveness of different feature combinations. Table 7.1 illustrates the results of six-feature combination methods for the top 10 to top 50 items. We can observe that the combination of similarity and favorite fraction performs the best for all cases, with 80% precision at the top 10. This is because both features are extracted from users’ preferences. Combinations using PostRank generally perform poorly.

Figure 7.2 AF of ranking using different features.

Table 7.1 Precision at N for Different Feature Combination Methods of All Users (k = 0.7)

Note: Bold face values signify the best among all other features.

An extreme example is combining PostRank with inverse update frequency, which results in zero precision. This is because neither feature is related to user preference. Combinations using favorite fraction generally perform better than using other features, indicating it is an important feature for recommendations.

To summarize, we have proposed four different features for recommending new items to users in their subscribed RSS feeds. Our experimental results show that favorite fraction and inverse update frequency are effective and perform better than the simple text similarity between a new item and the preference of a user. On the other hand, PostRank and age-based ranking are ineffective for RSS recommendations. We also study the performance of combining two features together for recommendations. Our evaluation indicates that the combination of text similarity and favorite fraction performs the best.

7.2 A Collaborative Filtering-Based Recommendation

Unlike the previously mentioned content-based approach, CF makes predictions from large-scale item–user matrices and has achieved widespread success in recommender systems such as Amazon and Yahoo! music. However, CF usually suffers from two fundamental problems—data sparsity and limited scalability. Within the two broad classes of CF approaches—namely, memory-based and model-based—the former usually falls short of the system scalability demands because these approaches predict user preferences over the entire item-user matrix. The latter often achieves unsatisfactory accuracy because they cannot precisely capture the diversity in user rating styles.

We discuss an efficient CF approach using smoothing and fusing (CFSF). CFSF formulates a CF problem as a local prediction problem by mapping it from the entire large-scale item–user matrix to a locally reduced item–user matrix. Given an active item and a user, CFSF constructs a local item–user matrix as the basis of prediction. To alleviate the data sparsity, CFSF presents fusion strategy for the local item–user matrix, which fuses the ratings the same user gives to similar items and like–minded users give to the same as well as similar items. To eliminate the diversity in user rating styles, CFSF uses a smoothing strategy that clusters users over the entire item–user matrix and then smoothes ratings within each user cluster.

7.2.1 Background on Collaborative Filtering

Recommender systems aim at predicting the rating of active item i_a made by active user u_b from user profiles. These profiles are represented as a Q × P item–user matrix X, where Q and P are the sizes of X. The notations used are the following:

■ I = {i₁, i₂,…,i_Q} and U = {u₁, u₂,…,u_P} represent the sets of items and users in X.

■ ${C_{u}^{1}, C_{u}^{2}, \dots, C_{u}^{L}}$ ${C_{u}^{1}, C_{u}^{2}, \dots, C_{u}^{L}}$ are L user clusters, where users in each cluster share some similar tastes.

■ I{u}, I{C_u}, and U{i} denote the set of items rated by user u, the set of items rated by user cluster C_u, and the set of users who have rated item i.

■ $r_{u_{b}, i_{a}}$ $r_{u_{b}, i_{a}}$ denotes the score that user b rates item a, $\bar{r_{i_{a}}}$ $\bar{r_{i_{a}}}$ , and $\bar{r_{u_{b}}}$ $\bar{r_{u_{b}}}$ represents the average ratings of item i_a and user u_b.

■ Let SI, SU, and SUI be the sets of similar items, like-minded users, and similar items and like-minded users, respectively.

■ SIR, SUR, and SUIR denote predicting user preferences over the entire item–user matrix from the ratings the same user makes on similar items, the like-minded users make on the same item, and the like-minded users make on the similar items.

■ SR represents predicting user preferences from all the ratings (i.e., SIR, SUR, and SUIR).

■ Let SIR′, SUR′, SUIR′, and SR′ be the counterparts of SIR, SUR, SUIR, and SR, but they are calculated over the local item–user matrix.

Then, the item vector of the matrix X is:

$X_{i} = [i_{1}, i_{2}, \dots, i_{Q}], i_{Q} = {[r_{1, q}, r_{2, q}, \dots, r_{p, q}]}^{T}$ $X_{i} = [i_{1}, i_{2}, \dots, i_{Q}], i_{Q} = {[r_{1, q}, r_{2, q}, \dots, r_{p, q}]}^{T}$

where q ∈[1,Q]. Each column vector i_m corresponds to the ratings of a particular item m by P users. Matrix X can also be represented by user vectors illustrated as:

$X_{u} = [u_{1}, u_{2}, \dots, u_{p}], u_{p} = {[r_{p, 1}, r_{p, 2}, \dots, r_{p, Q}]}^{T}$ $X_{u} = [u_{1}, u_{2}, \dots, u_{p}], u_{p} = {[r_{p, 1}, r_{p, 2}, \dots, r_{p, Q}]}^{T}$

where p ∈[1,P]. Each row vector $u_{p}^{T}$ $u_{p}^{T}$ indicates a user profile that represents a particular user’s item ratings. Item-based CF approaches, as illustrated in Figure 7.3a, find the similar items among item vectors and then use the ratings made by the same user to predict his or her preferences. For example, given an active item i_a and a user u_b, Equation 7.10 denotes the mechanism of item-based CF approaches, where $s i m_{i_{a}, i_{c}}$ $s i m_{i_{a}, i_{c}}$ is the similarity of items i_a and i_c, and it is usually computed by Pearson correlation coefficient (PCC) or vector space similarity (VSS):

$S I R : \hat{r_{u_{b}, i_{a}}} \leftarrow \frac{\sum_{i_{c} \in S I} s i m_{i_{a}, i_{c}} \cdot r_{u_{b}, i_{c}}}{\sum_{i_{c} \in S I} s i m_{i_{a}, i_{c}}}$ $S I R : \hat{r_{u_{b}, i_{a}}} \leftarrow \frac{\sum_{i_{c} \in S I} s i m_{i_{a}, i_{c}} \cdot r_{u_{b}, i_{c}}}{\sum_{i_{c} \in S I} s i m_{i_{a}, i_{c}}}$

(7.10)

Alternatively, user-based CF approaches take advantage of similar motivation to predict user preferences, where the ratings of like-minded users for the active item are used. Equation 7.11 shows the mechanism of user-based CF approaches, where $s i m_{u_{b}, u_{c}}$ $s i m_{u_{b}, u_{c}}$ is the similarity of users u_b and u_c.

Figure 7.3 The CFSF approach. (a) the original item-user rating matrix; (b) the item-user rating matrix produced by CFSF for prediction.

$S U R : \hat{r_{u_{b}, i_{a}}} \leftarrow \frac{\sum_{u_{c} \in S U} s i m_{u_{b}, u_{c}} \cdot r_{u_{c}, i_{a}}}{\sum_{u_{c} \in S U} s i m_{u_{b}, u_{c}}}$ $S U R : \hat{r_{u_{b}, i_{a}}} \leftarrow \frac{\sum_{u_{c} \in S U} s i m_{u_{b}, u_{c}} \cdot r_{u_{c}, i_{a}}}{\sum_{u_{c} \in S U} s i m_{u_{b}, u_{c}}}$

(7.11)

Both item-based and user-based CF approaches do not consider SUIR for accuracy improvement. Let i be an item similar to i_a and u be a user that is like-minded to u_b; SUIR is calculated as:

$S U I R : \hat{r_{u_{b}, i_{a}}} \leftarrow \frac{\sum_{u, i \in S U I} s i m_{(i, i_{a}), (u, u_{b})} \cdot r_{u, i}}{\sum_{u, i \in S U I} s i m_{_{(i, i_{a}), (u, u_{b})}}}$ $S U I R : \hat{r_{u_{b}, i_{a}}} \leftarrow \frac{\sum_{u, i \in S U I} s i m_{(i, i_{a}), (u, u_{b})} \cdot r_{u, i}}{\sum_{u, i \in S U I} s i m_{_{(i, i_{a}), (u, u_{b})}}}$

(7.12)

where $s i m_{(i, i_{a}), (u, u_{b})}$ $s i m_{(i, i_{a}), (u, u_{b})}$ is the weight for the rating user u makes on item i, denoting how much the rating r_u,i is considered in prediction. The weight in CFSF is defined in Equation 7.21.

Finally, UI-based CF approaches [6,20] have proposed combining SIR, SUR, and SUIR with a fusion function. However, because searching for active items and users over the entire item–user matrix is time-consuming, all memory-based CF approaches achieve limited scalability.

7.2.2 CFSF Approach

7.2.2.1 Overview

Figure 7.3 explains how to derive CFSF from traditional CF approaches. First, CFSF creates a global item similarity (GIS) matrix performed in a memory-based manner (i.e., search over the entire item–user matrix). Then, CFSF classifies users into clusters, within each of which unrated ratings are smoothed. These two steps significantly reduce the influence of ratings diversity and accelerate the selection of like-minded users. Based on the request for an active user from the recommender system, CFSF picks up the top M similar items from GIS and the top K like-minded users from user clusters and extracts related ratings to create the local item–user matrix. In the end, it predicts user preferences by fusing the ratings SIR′, SUR′, and SUIR′ over the local item–user matrix. Figure 7.3 also shows that CFSF significantly cuts down the number of users and items involved in prediction.

To cope with the increasing number of items and users in recommender systems, CFSF divides the CF process into offline and online phases. The offline phase involves three steps: creating the GIS, clustering users, and smoothing user ratings. This phase is a time-consuming process. The online phase includes the latter three steps: selecting items and users, constructing a local item–user matrix, and predicting preferences. Each user in CFSF has to rate a certain number of items so that recommendations for the user can be made.

7.2.2.2 Offline Phase

The three steps of the offline phase are:

1. Creating GIS ← I. In this step, CFSF creates GIS to save the item similarity matrix and to eliminate the diversity in item ratings. Items that are popular tend to get higher ratings than unpopular items. PCC, rather than pure cosine similarity (PCS), is selected as the item similarity function because PCS does not consider the diversity in item ratings. Given items i_a, i_b and U = U{i_a} ∩ U{i_b}, the similarity between i_a and i_b is defined as:

$s i m_{i_{a}, i_{b}} = \frac{\sum_{u \in U} (r_{u, i_{a}} - \bar{r_{i_{a}}}) \cdot (r_{u, i_{b}} - \bar{r_{i_{b}}})}{\sqrt{{\sum_{u \in U} (r_{u, i_{a}} - \bar{r_{i_{a}}})}^{2}} \cdot \sqrt{{\sum_{u \in U} (r_{u, i_{b}} - \bar{r_{i_{b}}})}^{2}}}$ $s i m_{i_{a}, i_{b}} = \frac{\sum_{u \in U} (r_{u, i_{a}} - \bar{r_{i_{a}}}) \cdot (r_{u, i_{b}} - \bar{r_{i_{b}}})}{\sqrt{{\sum_{u \in U} (r_{u, i_{a}} - \bar{r_{i_{a}}})}^{2}} \cdot \sqrt{{\sum_{u \in U} (r_{u, i_{b}} - \bar{r_{i_{b}}})}^{2}}}$

(7.13)

Given the large number of items, we set thresholds for Equation 7.13 to filter less important items. Then, the size of GIS will be greatly reduced.

2. Clustering users Ci ← U. In order to eliminate the diversity in user ratings, CFSF uses the K-means algorithm to cluster users and then smoothes ratings within each user cluster. The K-means method trains the data iteratively and assigns every user to a cluster whose centroid is closest to him or her. The time complexity of each iteration is linear in the size of the dataset. Compared with other clustering methods, K-means is simple, fast, and accurate. Its primary objective is minimizing $\sum_{i = 1}^{k} \sum_{u_{j} \in c_{j}} s i m | u_{j} - \bar{u} |$ $\sum_{i = 1}^{k} \sum_{u_{j} \in c_{j}} s i m | u_{j} - \bar{u} |$ where ū is the centroid of all users that belong to C_i. Similarly, sim|u_j − ū| is defined as Equation 7.14 based on a PCC similarity function, where u_a and u_b are users, and I = I{u_a} ∩ I{u_b}, denoting an item set that both users u_a and u_b have rated:

$s i m_{u_{a}, u_{b}} = \frac{\sum_{i \in I} (r_{u_{a}, i} - \bar{r_{u_{a}}}) \cdot (r_{u_{b}, i} - \bar{r_{u_{b}}})}{\sqrt{{\sum_{u \in U} (r_{u_{a}, i} - \bar{r_{u_{a}}})}^{2}} \cdot \sqrt{{\sum_{u \in U} (r_{u_{b}, i} - \bar{r_{u_{b}}})}^{2}}}$ $s i m_{u_{a}, u_{b}} = \frac{\sum_{i \in I} (r_{u_{a}, i} - \bar{r_{u_{a}}}) \cdot (r_{u_{b}, i} - \bar{r_{u_{b}}})}{\sqrt{{\sum_{u \in U} (r_{u_{a}, i} - \bar{r_{u_{a}}})}^{2}} \cdot \sqrt{{\sum_{u \in U} (r_{u_{b}, i} - \bar{r_{u_{b}}})}^{2}}}$

(7.14)

Thus, user clusters are generated for smoothing user ratings and for selecting the top K like-minded users.

3. Smoothing user ratings within C_i. There are two reasons for smoothing unrated data. First, users in the same cluster share similar tastes but have dissimilar rating styles; and so, the same item is rated differently by users belonging to the same cluster. This diversity in user ratings negatively affects the prediction accuracy. Second, rating data are quite sparse because users prefer not to rate items and also cannot rate all items due to the overwhelming number of items in recommender systems. Consequently, the prediction accuracy is unsatisfactory without capturing such diversity. CFSF uses a smoothing strategy similar to SCBPCC [7] to fill unrated data. The smoothing function is defined as:

$r_{u, i} = {\begin{array}{l} r_{u, i}, & i f u rates i \\ \bar{r_{u}} + Δ r C_{u^{'}, i}, & otherwise \end{array}$ $r_{u, i} = {\begin{array}{l} r_{u, i}, & i f u rates i \\ \bar{r_{u}} + Δ r C_{u^{'}, i}, & otherwise \end{array}$

(7.15)

where ΔrC_u′,i is the deviation of the average rating of item i in C_u′,i that is a set of users who rate the item i in user cluster C_u′. ΔrC_u′,i is given as Equation 7.16:

$Δ r C_{u^{'}, i} = \sum_{u \in C_{u^{'}, i}} (r_{u, i} - \bar{r_{u}}) / | C_{u', i} |$ $Δ r C_{u^{'}, i} = \sum_{u \in C_{u^{'}, i}} (r_{u, i} - \bar{r_{u}}) / | C_{u', i} |$

(7.16)

where |C_u′,i| is the size of C_u′,i.

After smoothing, CFSF creates iCluster for each user to store its similarity to each user cluster. These iClusters are sorted in descending order and are used for selecting the top K like-minded users. As a result, they accelerate selection efficiency considerably. The feature of a user cluster is denoted as a centroid that represents an average rating over all users in the cluster. Given an item set I = I{u_a} ∩ I{C_u′}, the similarity between user u_a and cluster C_u′ is defined as Equation 7.17. Thus, we get the iCluster for each user:

$s i m_{u_{a}, C_{u}'} = \frac{\sum_{i \in I} (r_{u_{a}, i} - \bar{r_{u_{a}}}) \cdot Δ r C_{u^{'}, i}}{\sqrt{{\sum_{i \in I} Δ (r C_{u^{'}, i})}^{2}} \cdot \sqrt{{\sum_{u \in U} (r_{u_{a}, i} - \bar{r_{u_{a}}})}^{2}}}$ $s i m_{u_{a}, C_{u}'} = \frac{\sum_{i \in I} (r_{u_{a}, i} - \bar{r_{u_{a}}}) \cdot Δ r C_{u^{'}, i}}{\sqrt{{\sum_{i \in I} Δ (r C_{u^{'}, i})}^{2}} \cdot \sqrt{{\sum_{u \in U} (r_{u_{a}, i} - \bar{r_{u_{a}}})}^{2}}}$

(7.17)

So far, we have described all the steps in the offline phase that are often computer-intensive, and hence, performed in the backend. The online phase of CFSF focuses on responding to requests, including constructing a local item–user matrix and fusing the ratings for prediction.

7.2.2.3 Online Phase

In general, user preferences are most likely derived from the most similar items and like-minded users. CFSF creates the local item–user matrix containing the most related users and items, and thus, yields significant savings in CPU time and memory cost. When a request comes, CFSF will pick up the top M similar items from GIS, the top K like-minded users from user clusters C, and will extract related ratings from the original item–user matrix.

■ Selecting top M similar items. Recall that CFSF sorts the result as GIS in descending order when it creates a GIS matrix. Consequently, CFSF can directly pick up the top M similar items from GIS.

■ Selecting top K like-minded users. User preferences are often scattered into several user clusters. Take movies, for example; user u may like action, fantasy, and crime films. To cover user preferences as much as possible, CFSF selects a user candidate set and then selects the top K like-minded users. To create a user candidate set, CFSF selects users from clusters in iCluster one by one. To select the top K like-minded users, CFSF varies the similarity function according to two types of ratings: original and smoothed ratings. It introduces a parameter w to differentiate these ratings. Given user u and active user u_a, the similarity function of selecting the top K like-minded users is defined as:

$s i m_{u_{a}, u} = \frac{\sum_{f} w_{u, i} \cdot (r_{u, i} - \bar{r_{u}}) \cdot (r_{u_{a}, i} - \bar{r_{u_{a}}})}{\sqrt{\sum_{f} w_{u, i}^{2} {(r_{u, i} - \bar{r_{u}})}^{2}} \cdot \sqrt{{\sum_{f} (r_{u_{a}, i} - \bar{r_{u_{a}}})}^{2}}}$ $s i m_{u_{a}, u} = \frac{\sum_{f} w_{u, i} \cdot (r_{u, i} - \bar{r_{u}}) \cdot (r_{u_{a}, i} - \bar{r_{u_{a}}})}{\sqrt{\sum_{f} w_{u, i}^{2} {(r_{u, i} - \bar{r_{u}})}^{2}} \cdot \sqrt{{\sum_{f} (r_{u_{a}, i} - \bar{r_{u_{a}}})}^{2}}}$

(7.18)

where f denotes i ∈ I{u_a} and w is the coefficient, defined as Equation 7.19. Depending on whether the rating is original or smoothed, the weighting coefficient w varies:

$w : w_{u, i} = {\begin{matrix} ε, & if u rates i \\ 1 - ε, & otherwise \end{matrix}$ $w : w_{u, i} = {\begin{matrix} ε, & if u rates i \\ 1 - ε, & otherwise \end{matrix}$

(7.19)

Compared with previous methods, CFSF reduces the computation overhead by selecting the like-minded users from iCluster rather than from the entire item–user matrix. Moreover, CFSF is capable of setting thresholds for Equation 7.18, which will further reduce the computation overhead. After the top M similar items and the top K like-minded users are selected, CFSF will extract related ratings from the original item–user matrix to fill the local item–user matrix.

■ Fusing SIR′, SUR ′, and SUIR ′. CFSF defines SIR′, SUR′, and SUIR′ as predictions of user preferences over the local item–user matrix from the ratings the same user makes on similar items and the like-minded users make on the same and similar items. M and K are much less than Q and P. As a result, SIR ′, SUR′, and SUIR′ are computed much faster than SIR, SUR, and SUIR, which are calculated over the entire large-scale item–user matrix. Given active item i_a and user u_b, Equation 7.20 illustrates these definitions:

$\begin{array}{l} S I R^{'} & = & \frac{\sum_{s = 1}^{M} w \cdot s i m_{i_{s}, i_{a}} \cdot r_{u_{b}, i_{s}}}{\sum_{s = 1}^{M} w \cdot s i m_{i_{s}, i_{a}}} \\ S U R^{'} & = & \frac{\sum_{t = 1}^{K} w \cdot s i m_{u_{t}, u_{b}} \cdot (r_{u_{t}, i_{a}} - \bar{r_{u_{t}}})}{\sum_{t = 1}^{K} w \cdot s i m_{u_{t}, u_{b}}} + \bar{r_{u_{b}}} \\ S U I R^{'} & = & \frac{\sum_{K}^{K} \sum_{s = 1}^{M} w \cdot s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})} \cdot r_{u, i}}{\sum_{t = 1}^{K} \sum_{s = 1}^{K} w \cdot s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})}} \end{array}$ $\begin{array}{l} S I R^{'} & = & \frac{\sum_{s = 1}^{M} w \cdot s i m_{i_{s}, i_{a}} \cdot r_{u_{b}, i_{s}}}{\sum_{s = 1}^{M} w \cdot s i m_{i_{s}, i_{a}}} \\ S U R^{'} & = & \frac{\sum_{t = 1}^{K} w \cdot s i m_{u_{t}, u_{b}} \cdot (r_{u_{t}, i_{a}} - \bar{r_{u_{t}}})}{\sum_{t = 1}^{K} w \cdot s i m_{u_{t}, u_{b}}} + \bar{r_{u_{b}}} \\ S U I R^{'} & = & \frac{\sum_{K}^{K} \sum_{s = 1}^{M} w \cdot s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})} \cdot r_{u, i}}{\sum_{t = 1}^{K} \sum_{s = 1}^{K} w \cdot s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})}} \end{array}$

(7.20)

where w is defined as Equation 7.19 and $s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})}$ $s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})}$ is defined by Euclidean distance as Equation 7.21, denoting the weight for the rating of the similar item i by the like-minded user u.

$s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})} = \frac{s i m_{i_{s}, i_{a}} \cdot s i m_{u_{t}, u_{b}}}{\sqrt{s i m_{i_{s}, i_{a}}^{2} + s i m_{u_{t}, u_{b}}^{2}}}$ $s i m_{(i_{s}, i_{a}), (u_{t}, u_{b})} = \frac{s i m_{i_{s}, i_{a}} \cdot s i m_{u_{t}, u_{b}}}{\sqrt{s i m_{i_{s}, i_{a}}^{2} + s i m_{u_{t}, u_{b}}^{2}}}$

(7.21)

CFSF selects SUR′ as the major prediction tool and SIR′ and SUIR′ as supplementary when it predicts user preferences. CFSF introduces two parameters, λ and δ, to balance the impact of SIR′, SUR′, and SUIR′. The fusion function of CFSF is defined as:

$S R^{'} : \hat{r_{u_{b}, i_{a}}} = (1 - δ) \cdot (1 - λ) \cdot S I R^{'} + (1 - δ) \cdot λ \cdot S U R^{'} + δ \cdot S U I R^{'}$ $S R^{'} : \hat{r_{u_{b}, i_{a}}} = (1 - δ) \cdot (1 - λ) \cdot S I R^{'} + (1 - δ) \cdot λ \cdot S U R^{'} + δ \cdot S U I R^{'}$

(7.22)

where λ and δ are between 0 and 1.

To summarize, the offline phase of CFSF has a high cost due to the creation of the GIS matrix and clustering. To reduce computation overhead in the offline phase, CFSF sets thresholds to filter items. In the online phase, the time complexity of CFSF is O(MK), where M and K are the number of similar items and like-minded users. Considering that M and K are much less than the original sizes of an item–user matrix, the online phase of CFSF is scalable.

7.2.3 Performance of CFSF

We evaluated the performance of CFSF with the MovieLens dataset [21] from the University of Minnesota. We randomly extracted 500 users from MovieLens, where each user rated at least 40 movies. We changed the size of the training set by selecting the first 100, 200, and 300 users, denoted as ML_100, ML_200, and ML_300, respectively. We selected the last 200 users as the test set. We varied the number of items rated by active users from 5 to 10 to 20, denoted as Given5, Given10, and Given20, respectively. Table 7.2 summarizes the statistical features of the datasets used in our experiments.

7.2.3.1 Metrics

To be consistent with experiments reported in the literature [5–7,22,23], we use the same MAE as the evaluation metric, defined as:

$M A E = \frac{\sum_{u \in T} | r_{u, i} - \bar{r_{u, i}} |}{| T |}$ $M A E = \frac{\sum_{u \in T} | r_{u, i} - \bar{r_{u, i}} |}{| T |}$

(7.23)

where r_u,i denotes the rating that user u gives item i, T represents the test set, and |T| is the test size. The smaller the value of MAE, the better the performance.

7.2.3.2 Overall Performance

We carried out experiments from two aspects to evaluate the performance of CFSF. One aspect is to compare CFSF with traditional memory-based CF approaches: an item-based approach using PCC (SIR) and a user-based approach using PCC (SUR). For MovieLens, the parameters of CFSF are set as follows: C = 30, λ = 0.8, δ = 0.1, K = 25, M = 95, and w = 0.35. Table 7.3 illustrates the results, showing that CFSF considerably outperforms the SUR and SIR with respect to prediction accuracy.

We compare CFSF with the other state-of-the-art CF approaches; that is, AM [24], EMDP [20], PD [25], SCBPCC [7] and SF [6]. We varied the item number that each user was required to rate on all the test sets for the MovieLens dataset. The results are shown in Table 7.4. As the test set increases, the MAEs of all approaches show a downward trend. As the number of rated items for each user increases from 5 to 20, a similar trend is observed. Among them, CFSF achieves the best accuracy. This is because CFSF can select the most similar items and like-minded users.

Table 7.2 Statistics of the Datasets

	MovieLens
No. of users	500
No. of items	1000
Average no. of rated items per user	94.4
Density of data	9.44%
No. of ratings	5

Table 7.3 MAE on MovieLens for the SIR, SUR, and CFSF

Training Set	Methods	Given5	Given10	Given20
ML_300	CFSF	0.743	0.721	0.705
	SUR	0.838	0.814	0.802
	SIR	0.870	0.838	0.813
ML_200	CFSF	0.769	0.734	0.713
	SUR	0.843	0.822	0.807
	SIR	0.855	0.834	0.812
ML_100	CFSF	0.781	0.758	0.746
	SUR	0.876	0.847	0.811
	SIR	0.890	0.801	0.824

Note: Bold face values signify the best MAE among all methods.

Table 7.4 MAE on MovieLens for the State-of-the-Art CF Approaches

Training Set	Methods	Given5	Given10	Given20
ML_300	CFSF	0.743	0.721	0.705
	AM	0.820	0.822	0.796
	EMDP	0.788	0.754	0.746
	SCBPCC	0.822	0.810	0.778
	SF	0.804	0.761	0.769
	PD	0.827	0.815	0.789
ML_200	CFSF	0.769	0.734	0.713
	AM	0.849	0.837	0.815
	EMDP	0.793	0.760	0.751
	SCBPCC	0.831	0.813	0.784
	SF	0.827	0.773	0.783
	PD	0.836	0.815	0.792
ML_100	CFSF	0.781	0.758	0.746
	AM	0.963	0.922	0.887
	EMDP	0.807	0.769	0.765
	SCBPCC	0.848	0.819	0.789
	SF	0.847	0.774	0.792
	PD	0.849	0.817	0.808

Note: Bold face values signify the best MAE among all methods.

7.3 Preference-Based Top-K Recommendation in Social Networks

Social networks are social structures representing the relationships among people. Examples of social networks include Facebook online, mobile phone networks [26], and scientific collaboration networks [27]. A social network is an important way of spreading information and works via the word-of-mouth or viral marketing approach [28,29]. The selected users influence their friends on the social network, and friends influence their friends. Finally, a large number of users would choose the product. Besides marketing, other important applications on the social network include finding the most important blogs [30] and searching for domain experts [31,32].

All of the aforementioned applications can be generalized as the problem of finding top-K influential nodes in networks (i.e., which K users should be selected so that they eventually influence the most people in the network). More formally, this can be formulated as a discrete optimization problem called an influence maximization problem [33]; for a parameter K and a social network, where influence is propagated in it according to a stochastic cascade model, find a K-node set that eventually has the maximum influence.

The influence maximization problem was first studied as viral marketing strategies [34,35] and revenue maximization [28]. Kempe et al. [33] formulate influence maximization as a discrete optimization problem and prove it to be NP-hard. They present a greedy approximation algorithm and prove that the optimal solution for an influence maximization problem can be approximated to within a factor of (1−1/e).

The algorithm of Kempe et al. is inefficient and more efficient algorithms [26,27,30,36] have been proposed, which have been extended to social networks at a greater scale. CELF [30] optimizes the greedy algorithm by exploiting the submodularity property to reduce the number of evaluations on the influence spread of users, which we borrowed to use in this work. NewGreedy [36] removes edges that will not contribute to propagation to get a smaller graph. MixedGreedy [36] combines the CELF and the NewGreedy, where the first round uses NewGreedy and the rest of the rounds use CELF. CGA [26] first detects communities in a social network and then employs a dynamic programming algorithm for selecting communities to find influential users. Kimura and Saito [27] propose shortest-path-based influence cascade models and provide efficient algorithms for computing influence spread under these models. However, they do not directly address the efficiency issue of the greedy algorithms studied by Kempe et al. because the influence of cascade models is different.

All of these approaches do not consider user preferences during influence diffusion and use a uniform probability. However, some studies [37,38] have found that the diffusion of different topics in the network is not the same because users have different preferences that will affect their roles in the spread of a specific topic [39]. Unfortunately, although Tang et al. [39] combine the specific topic and social influence analysis together, demonstrating that different topics lead to different influence results, there is no research about an influence maximization problem taking the user preferences into account. This drawback greatly affects the accuracy of greedy approximation algorithms proposed by Kempe et al. and others.

We study the influence maximization problem considering user preferences. We propose a two-stage algorithm, called the greedy algorithm based on users’ references (GAUP), for mining top-K influential users for a given topic. The first stage calculates each user’s preference for a specific topic by singular value decomposition (SVD)–based latent semantic indexing (LSI) [40]. The second stage combines the traditional greedy algorithms and user preferences calculated from the first stage, and then it computes an approximate solution for the influence maximization problem for a specific topic. Our evaluation results with an academic social network demonstrate that our GAUP algorithm can maximize the influence spread on a topic. We have compared GAUP with an SVD-based CF algorithm and HITS for expert search, and we have found that GAUP is more likely to find the most influential domain experts than CF. In addition, GAUP is more reliable than HITS because HITS suffers from the problem of topic drift.

7.3.1 Problem Formulation

We formulate the problem as follows. A social network is an undirected graph G = (V,E). Vertices in V are the nodes in the network, and edges in E model the relationship between nodes. There is a set of users’ documents with topic labels. Given a number K and a topic label T, the task is to generate a seed set S of K vertices, with the objective that influence spread is as large as possible for topic T from the seed set.

We extract an academic coauthor network from DBLP, where G is a coauthorship graph, vertices are authors of academic papers, and an edge indicates that the corresponding two authors have coauthored a paper. Parallel edges between two vertices denote the number of papers coauthored by the two authors. In DBLP, we can get information on each paper, including authors and conferences. We regard each conference as a distinct topic and each paper as a labeled document. Table 7.5 lists the notations used in this section.

When considering models for influence spread through a social network, we often say a node is either active or inactive. Active nodes are those that are influenced by other active nodes and that are able to influence their inactive neighbors when they become active; inactive nodes are those that have not been influenced by their active neighbors. The influence diffusion from the perspective of an initially inactive node v unfolds as follows: As time progresses, more and more of v’s neighboring nodes become active; v may become active at some point because of this; and v’s activation may also affect its neighboring nodes. Each node can switch from an inactive to active state but not the opposite.

Table 7.5 Notations

Terms	Descriptions
G = (V,E)	A graph G with vertex set V and edge set E
K	Number of seeds to be selected
R	Number of rounds of simulations
T	A selected topic (conference)
p	The uniform propagation probability
p_v,w	The propagation probability that node v activates node w
C_a,T	Node a’s preference on topic T
S	The seed set of K influential nodes
IS(S)	Influence spread of S
ISST(S,T)	Influence spread of S on a specific topic T

Popular influence diffusion models include the linear threshold model and the independent cascade model (IC). In the linear threshold model, a node v becomes active if all its neighboring nodes’ influence added together exceeds a threshold. All nodes that were active in step t−1 remain active in step t. The threshold for each node is uniformly distributed within interval [0,1], representing the fraction of v’s neighbors that must be active in order to activate v.

In the IC model, the diffusion process also unfolds in discrete steps; when an active node v becomes active in step t−1, it is given a single chance to activate each of its inactive neighbors with a probability p, which is independent of the history thus far. If multiple nodes try to activate the same node, their attempts are sequenced in an arbitrary order. If successful, the activated node becomes active in step t. Whether or not v succeeds, it cannot make any further attempts to activate its neighboring nodes in subsequent rounds. The process ends when no activations are possible.

The activation probability p in the work of Kempe et al. is uniform to each edge of the graph. If node v and w have c_v,w parallel edges, v has a total probability of $1 - {(1 - p)}^{c_{v, w}}$ $1 - {(1 - p)}^{c_{v, w}}$ to activate w once it becomes active.

These models do not consider topics and users’ preferences on different topics. Intuitively, the more preferences on topic T by v and its neighboring nodes, the more likely the influence will happen. Thus, we consider the activation probability p is not uniform but is related to users’ preferences on a topic T.

7.3.2 Computing User Preferences

To compute user preferences for a specific topic, we adopt the SVD-based LSI model that can project user preferences into a reduced latent space. By assigning a weight for each latent item, our algorithm can compute the preference value of a user for a given topic.

SVD [41,42] is a well-known matrix factorization method that factors an n×c matrix R into three matrices:

$R = U \cdot Σ \cdot V^{T}$ $R = U \cdot Σ \cdot V^{T}$

(7.24)

where U and V are two orthogonal matrices of size n×n and c×c, respectively. Σ is a diagonal matrix of size n×c that has all singular values of matrix R as its diagonal entries. The rank of R and Σ is r (r≤c≤n). All the entries of matrix Σ are positive and are stored in decreasing order of their magnitude.

Typical usage of SVD keeps k largest diagonal values of Σ, thus producing a rank-k approximation matrix R_k. Over all rank-k matrices, R_k minimizes the Frobenius norm $| | R - R_{k} | |$ $| | R - R_{k} | |$ . In other words, SVD provides the best lower rank approximations of the original matrix R. The selection of k should be large enough to capture all the important structures in the original matrix and small enough to avoid overfitting errors.

We use SVD to compute user preferences, which are represented as low-dimensional latent features learned from a user–topic matrix. Specifically for a given user–topic matrix M, our algorithm computes preference prediction using the following steps:

1. Factor M using SVD to obtain U, Σ, and V.

2. Reduce the matrix Σ to dimension k.

3. Compute the square root of the reduced matrix Σ_k to obtain $Σ_{k}^{1 / 2}$ $Σ_{k}^{1 / 2}$ .

4. Compute two matrices X and Y, where $X = U_{k} Σ_{k}^{1 / 2}$ $X = U_{k} Σ_{k}^{1 / 2}$ and $Y = Σ_{k}^{1 / 2} V_{k}^{T}$ $Y = Σ_{k}^{1 / 2} V_{k}^{T}$

5. Predict user a’s preference on topic T as follows:

$C_{a, T} = C_{0} + X (a) \cdot Y (T)$ $C_{a, T} = C_{0} + X (a) \cdot Y (T)$

(7.25)

X is a matrix with dimension of n×k, describing the authors’ preferences in the k-dimensional latent space (i.e., the weights of k latent features by the authors). X(a) denotes ath row of X, representing user a’s weights for k latent features. Y is a matrix with dimension of k×c, representing the relationship between k-dimensional latent space and c topics. Y(T) denotes Tth column of Y, representing k latent features’ weights to Tth topic. C₀ is a constant.

To compute the predictions, we simply calculate the dot product of the ath row of X and the Tth column of Y and add the C₀ back. C_a,T is the prediction of ath author’s preference for topic T.

Table 7.6 Translation Rules for Converting Publication Counts to 1–5 Ratings

# in this conference/personal total #	Ratings
0	0
(0, 0.02]	1
(0.02, 0.05]	2
(0.05, 0.1]	3
(0.1, 0.15]	4
(0.15, 1)	5

From the DBLP dataset, we can get a user–topic matrix M, where rows are authors and columns are conferences. The question is how to define values in matrix M. Inspired by the rating matrix used in CF, entries in matrix M are rated with a five-star scale, where high scores mean that users prefer the conference more favorably. Table 7.6 lists rules for such a translation.

Following these rules, we obtain the user–topic matrix M, where M_ij denotes the preference of author i for conference j. A zero value means author i is not interested in conference j or has not published in the conference.

The resulting matrix is large and extremely sparse. Performing SVD on such a matrix is computationally intensive. To solve this problem, we remove authors who have published in less than 10 conferences from the matrix, thus reducing the size of the matrix significantly. The intuition is that we are interested in the most influential people among all authors, and these important ones tend to publish heavily at many different conferences.

7.3.3 Greedy Algorithm for Mining Top-K Nodes

In the traditional IC model, the probability of an activation is uniform to all edges. However, such a uniform probability cannot capture the preferences of users on different topics. For mobile social networks, Wang et al. [26] extend the IC model to accommodate contact frequency. Similarly, GAUP considers that the probability of influence is not only associated with influence frequency but also correlated with user preferences: the more preferences on topic T by v and w, the more likely the influence will happen. In order to describe this property, we define p_v,w as follows:

$p_{v, w} = p . F (C_{v, T}, C_{w, T})$ $p_{v, w} = p . F (C_{v, T}, C_{w, T})$

(7.26)

This new influence model is called the extended independent cascade (EIC) model. In the aforementioned formula, the original uniform probability p is weighted with a user preference function $F (\cdot)$ $F (\cdot)$ . The C_v,T and C_w,T are calculated by Equation 7.25. In experiments, we choose the function $F (C_{v, T}, C_{w, T})$ $F (C_{v, T}, C_{w, T})$ to be the square of the product of C_v,T and C_w,T. When node v and w have C_v,w parallel edges, v has a total probability of $1 - {(1 - p_{v, w})}^{c_{v, w}}$ $1 - {(1 - p_{v, w})}^{c_{v, w}}$ to activate w once it becomes active.

Mining top-K influential nodes in the new EIC model is NP-hard but can be approximated with a greedy hill-climbing algorithm. Nemhauser et al. [43,44] have shown that if the influence function f(⋅) is submodular, nonnegative, and monotone, the NP-hard influence maximization problem can be approximated by a greedy hill-climbing algorithm within a factor of (1−1/e). The EIC model is an edge-weighted version of an IC model, for which Kempe et al. [33] prove that the influence function is submodular. The objective function is obviously nonnegative and monotone. Thus, we can use the greedy strategy to obtain a (1−1/e−ε) approximation [33].

To compute the influence spread in mining the seed set and in evaluating the performance of algorithms, we often use Monte Carlo simulations. Let S be the seed set computed from the greedy algorithm and RS(S) denote the resulting set of influence cascade. The influence spread after the random process is |RS(S)|. For any topic T, we obtain a user-preference vector UP(T) from the previous step. GAUP takes the graph G, the number K, and the vector UP(T) as input and generates a seed set S of cardinality K, with the purpose that the expected influence spread by the seed set S is as large as possible for topic T.

The idea of the greedy algorithm is to run for K rounds, where each round finds the vertex that will maximize the incremental influence spread in the round. Algorithm 7.1 describes the details of GAUP. First, we calculate the propagation probability of each edge in the social network using the vector UP(T), based on the EIC model (lines 5–8). Then, for each vertex, we compute its influence spread and insert the vertex into a priority queue, sorted by influence spread values (lines 10–17). The first vertex chosen will be the head element in the queue (lines 19–21). For the rest K−1 vertices, the algorithm first pops a vertex from the priority queue. If the vertex has been visited in this round, then we found the vertex for this round (lines 25–30). Otherwise, the increase of influence spread by adding this node to current set S is computed, and the vertex is inserted back to the priority queue (lines 32–37).

Note that lines 23–39 implement the CELF optimization [30], which takes advantage of the submodularity property of the influence maximization objective. By exploiting submodularity during each round, the incremental influence spread of a large number of nodes does not need to be re-evaluated when their values in the previous round are less than the values of some other nodes in the current round. Previous work [30] has shown that this optimization can improve the performance by a factor of up to 700 times.

Algorithm 7.1 The Greedy Algorithm for Mining Top-K Influential Nodes

1: Input: graph G = (V, E), number K, topic T, user preference vector U P (T)

2: Output: a seed set S

4: Initialize S = ∅, R = 10000, Q = ∅, current_is = 0

5: // Compute activation probability of each edge

6: for each edge (v, w) ∈ E do

7: pv,w = p ⋅ F (Cv,T, Cw,T)

8: end for

10: // Compute influence spread of each vertex

11: for each vertex v ∈ V do

12: isv = 0

13: for j = 1 to R do

14: isv + = |RS(v)|/R

15: end for

16: insert node(v, isv) to Q

17: end for

18:

19: // 1st vertex is the one that has the largest influence spread

20: node = pop(Q)

21: S = {node.v}, current_is = node.is

22:

23: for i = 2 to K do

24: while true do

25: node = pop(Q)

26: if node.v has been visited in this round i then

27: S = S ∪ {node.v}

28: current is+ = node.is

29: break

30: end if

31:

32: v = node.v, is = 0

33: for j = 1 to R do

34: is+ = |RS(S ∪ {v}|)|

35: end for

36: is = is/R − current is

37: insert node(v, is) to Q

38: end while

39: end for

7.3.4 Performance

We conduct experiments on an academic coauthorship dataset from DBLP, which provides bibliographic information on major computer science journals and proceedings. DBLP indexes more than one million articles. In the extracted coauthor network, each node represents an author, and an edge represents that the corresponding two authors have collaborated once. Parallel edges mean that the authors have collaborated more than once. We only select the papers published in conferences and extract the authors who have published in at least 10 conferences. The generated network with papers from 411 conferences contains 8627 nodes and 91,574 edges.

7.3.4.1 Influence Model

We conduct experiments to compare the state-of-the-art greedy algorithm using an IC model and the GAUP using an EIC model. In the IC model, the probability p is set to 0.01 [33]. In the EIC model, the probability p_v,w is computed by Equation 7.2.

7.3.4.2 Algorithms and Metrics

Table 7.7 lists the algorithms that are compared within the experiments. GA and GAUP are greedy algorithms mentioned in Section 7.3.3. CF is the collaborative filtering algorithm based on SVD that selects nodes with biggest C_v,T values as the seed set. For an expert search, we also consider the HITS algorithm [45]. For coauthors of a paper, we add links pointing to the first author from all other coauthors.

Table 7.7 The Algorithms

Algorithms	Descriptions
GAUP	Greedy algorithm using LSI-based user preferences
GA	The state-of-the-art greedy algorithm
CF	Collaborative filtering algorithm based on SVD
Random	Baseline of choosing random nodes in the graph
HITS	Hyperlink-induced topic search algorithm [44]

For evaluation, the following metrics are used:

■ ISST(S,T): Influence spread of seed set S on a specific topic T. This metric evaluates the influence on a specific topic that takes into account user preferences. ISST is defined as Equation 7.2, where node v belongs to all nodes being activated by seed set S.

$I S S T (S, T) = \sum_{v \in R S (S)} C_{v, T}^{2}$ $I S S T (S, T) = \sum_{v \in R S (S)} C_{v, T}^{2}$

(7.27)

■ IS(S): The traditional influence spread of seed set S that does not take user preferences into account.

To obtain the IS(S) and ISST(S,T) for each seed set S, we run Monte Carlo simulations (using the corresponding IC and EIC model) 10,000 times and take the average. For all these algorithms, we compare their IS(S) and ISST(S,T) with different S sizes of 5, 10, 15, and 20.

7.3.4.3 Study on Influence Spread

We evaluate the performance of the proposed GAUP algorithm in a real coauthor network and compare GAUP with other algorithms on the specific topic of SIGCOMM. Figure 7.4 shows the ISST of various algorithms using the EIC model. From the figure, we can observe that random as the baseline performs very badly. GAUP performs significantly better than other algorithms when K≥15. GAUP outperforms GA by about 10% when K is 15 and more than 30% when K is 20. This is because GAUP can find more important nodes with respect to the specified topic. When K is 5 or 10, GA finds nodes that can activate more nodes than GAUP (shown in Figure 7.5). As a result, the cumulative influence of GA on the topic exceeds that of GAUP. CF performs worse than GA by about 30% because CF does not consider the influence diffusion process.

Figure 7.4 A comparison of ISST on a specific topic.

Figure 7.5 A comparison of influence spread of different algorithms.

Figure 7.5 compares the IS(S) of various algorithms in the IC model. IS(S) does not take user preferences into account. As a result, the performance of GAUP is not as good as GA because GA can find most influential nodes across all topics. GAUP performs significantly better than random and CF because GAUP can find most influential nodes in a specific topic. CF does not perform well because the influence diffusion process is not used.

7.3.4.4 ISST versus IS

We study the effectiveness of ISST and IS when considering users’ topic preferences. In the experiment, we first use the GAUP algorithm to generate two seed sets, comm and kdd, for topic SIGCOMM and SIGKDD, respectively. With these two seed sets, we compute their IS and ISST values. Table 7.8 shows that when K increases, IS(comm) and IS(kdd) are becoming very close. In contrast, Table 7.9 illustrates the ratios of ISST of comm over ISST of kdd for different topics. For topic SIGCOMM, we can observe that ISST(comm,SIGCOMM) is significantly larger than ISST(kdd,SIGCOMM), more than two times when K≥15. On the other hand, ISST(comm,SIGKDD) is only about one-third of ISST(kdd,SIGKDD) for topic SIGKDD when K≥15. These results demonstrate that ISST effectively measures the influence spread on specific topics and should be used as the metric other than the traditional IS when considering users’ preferences.

7.3.4.5 Expert Search

This experiment studies whether the GAUP algorithm can find experts in a specific domain. Again, we choose SIGCOMM as the topic. Table 7.10 shows the seed sets obtained by GA, CF, GAUP, and HITS when K = 5. None of the authors selected by GA is in the networking field because GA does not take domain preference into account and tends to find experts in other domains. In contrast, GAUP, CF, and HITS can successfully find networking experts. We verified that all these people have published in well-known networking conferences. Except for Jennifer Rexford, all the other four experts selected by GAUP and CF are different. James F. Kurose appears in both GAUP and HITS. Joseph Naor publishes lots of papers at networking conferences but none in SIGCOMM. This indicates that our algorithm can effectively find domain experts.

Table 7.8 IS(comm) and IS(kdd)

Seed Set	5	10	15	20
comm	11	30	47	86
kdd	30	41	56	89

Table 7.9 Ratios of ISST of comm over ISST of kdd for Different Topics

Topic	5	10	15	20
SIGCOMM	1.67	2.47	2.91	2.81
SIGKDD	0.13	0.16	0.31	0.27

Table 7.10 The Seed Set of GA, CF, and GAUP When k = 5

GAUP	GA	CF	HITS
James F. Kurose	Wei-Ying Ma	Simon S. Lam	Donald F. Towsley
Jennifer Rexford	Philip S. Yu	Jennifer Rexford	James F. Kurose
Joseph Naor	Wen Gao	Vishal Misra	Zhen Liu
Deborah Estrin	Mahmut T. Kandemir	Donald F. Towsley	Weibo Gong
Thomas E. Anderson	Thomas S. Huang	Lixia Zhang	Vishal Misra

To further study the effectiveness of GAUP, CF, and HITS, we designed the following experiments to compare GAUP and CF and to demonstrate the topic drift problem of HITS.

7.3.4.6 GAUP versus CF

The GAUP and CF results given in Table 7.10 generally make it hard to measure which set of users is better. To address this problem, we design the following experiment. For the field of networks, we choose two topics, SIGCOMM and ICCCN. Then GAUP and CF algorithms are run to find the top 5, 10, 15, and 20 experts. Finally, we consider experts that are found by both topics.

Figure 7.6 illustrates the results of these two algorithms. We can observe that for the top 5, 10, and 15 experts, CF finds none that appear in both topics. In contrast, GAUP discovers 2, 4, and 5 overlapping ones for the top 5, 10, and 15, respectively. CF only finds one overlapping for the top 20 users, in which case GAUP finds 7. This experiment shows that GAUP can find common experts for different subtopics in a domain. Thus, these experts are more likely to be the most influential ones in the domain. The CF algorithm does not consider the influence spread process and only finds experts for specific topics. As a result, CF does not produce results as good as those of GAUP.

7.3.4.7 Topic Drift of HITS

The effectiveness of HITS depends on the quality of the initial set of nodes. Previous work [46,47] has shown that HITS may suffer from topic drift. To illustrate this problem, we run the HITS algorithm with three different seed sets for topic SIGCOMM:

■ 20-set. We select 20 users from the experts returned by GAUP and CF for topic SIGCOMM.

Figure 7.6 A comparison of GAUP and CF for finding experts in both ICCCN and SIGCOMM.

Table 7.11 Topic Drift of HITS Algorithm Using Three Different Seed Sets

20-Set	30-Set	31-Set
Donald F. Towsley	Donald F. Towsley	Wen Gao
James F. Kurose	James F. Kurose	Xilin Chen
Zhen Liu	Zhen Liu	Shiguang Shan
Weibo Gong	Krithi Ramamritham	Qingming Huang
Vishal Misra	Weibo Gong	Debin Zhao

■ 30-set. We choose 10 new users together with 20-set, where some of new users are not in the field of networks.

■ 31-set. We add a new user, Wen Gao, to 30-set.

Table 7.11 shows the top five experts found by the HITS algorithm for these three seed sets. The results of 20-set and 30-set only differ for one person, and all of them are well-known networking people. However, none of the results for 31-set is in the networking field. This is because Wen Gao and its neighboring nodes are well connected in the graph generated by HITS. As a result, the most highly ranked authorities and hubs deviate from the original topic of SIGCOMM.

This experiment demonstrates that the quality of HITS is highly dependent on the seed set. A direct comparison of HITS and our GAUP is difficult because GAUP does not need such seeds. We emphasize that GAUP is more reliable than HITS for finding experts because of the considerations for user preferences in the influence model.

Table of Contents for 7 User Preferences and Recommendations

Create new playlist

Sign In

Sign Up

Table of Contents for
7 User Preferences and Recommendations