To rank the importance of the online reviews, we have implemented the PageRank algorithm (see Chapter 4, Web Mining Techniques, in the Ranking: PageRank algorithm section) into the application. The pgrank.py
file in the pgrank
folder within the webmining_server
folder implements the algorithm that follows:
from pages.models import Page,SearchTerm num_iterations = 100000 eps=0.0001 D = 0.85 def pgrank(searchid): s = SearchTerm.objects.get(id=int(searchid)) links = s.links.all() from_idxs = [i.from_id for i in links ] # Find the idxs that receive page rank links_received = [] to_idxs = [] for l in links: from_id = l.from_id to_id = l.to_id if from_id not in from_idxs: continue if to_id not in from_idxs: continue links_received.append([from_id,to_id]) if to_id not in to_idxs: to_idxs.append(to_id) pages = s.pages.all() prev_ranks = dict() for node in from_idxs: ptmp = Page.objects.get(id=node) prev_ranks[node] = ptmp.old_rank conv=1. cnt=0 while conv>eps or cnt<num_iterations: next_ranks = dict() total = 0.0 for (node,old_rank) in prev_ranks.items(): total += old_rank next_ranks[node] = 0.0 #find the outbound links and send the pagerank down to each of them for (node, old_rank) in prev_ranks.items(): give_idxs = [] for (from_id, to_id) in links_received: if from_id != node: continue if to_id not in to_idxs: continue give_idxs.append(to_id) if (len(give_idxs) < 1): continue amount = D*old_rank/len(give_idxs) for id in give_idxs: next_ranks[id] += amount tot = 0 for (node,next_rank) in next_ranks.items(): tot += next_rank const = (1-D)/ len(next_ranks) for node in next_ranks: next_ranks[node] += const tot = 0 for (node,old_rank) in next_ranks.items(): tot += next_rank difftot = 0 for (node, old_rank) in prev_ranks.items(): new_rank = next_ranks[node] diff = abs(old_rank-new_rank) difftot += diff conv= difftot/len(prev_ranks) cnt+=1 prev_ranks = next_ranks for (id,new_rank) in next_ranks.items(): ptmp = Page.objects.get(id=id) url = ptmp.url for (id,new_rank) in next_ranks.items(): ptmp = Page.objects.get(id=id) ptmp.old_rank = ptmp.new_rank ptmp.new_rank = new_rank ptmp.save()
This code takes all the links stores associated with the given SearchTerm
object and implements the PageRank score for each page i at time t, where P(i) is given by the recursive equation:
Here, N is the total number of pages, and (Nj is the number of out links of page j) if page j points to i; otherwise, N is 0
. The parameter D is the so-called damping factor (set to 0.85 in the preceding code), and it represents the probability to follow the transition given by the transition matrix A. The equation is iterated until the convergence parameter eps
is satisfied or the maximum number of iterations, num_iterations
, is reached. The algorithm is called by clicking either scrape and calculate page rank (may take a long time) or calculate page rank links at the bottom of the home.html
page after the sentiment of the movie reviews has been displayed. The link is linked to the function pgrank_view
in the views.py
(through the declared URL in urls.py: url(r'^pg-rank/(?P<pk>d+)/','webmining_server.views.pgrank_view', name='pgrank_view')
):
def pgrank_view(request,pk): context = {} get_data = request.GET scrape = get_data.get('scrape','False') s = SearchTerm.objects.get(id=pk) if scrape == 'True': pages = s.pages.all().filter(review=True) urls = [] for u in pages: urls.append(u.url) #crawl cmd = 'cd ../scrapy_spider & scrapy crawl scrapy_spider_recursive -a url_list=%s -a search_id=%s' %('"'+str(','.join(urls[:]).encode('utf-8'))+'"','"'+str(pk)+'"') os.system(cmd) links = s.links.all() if len(links)==0: context['no_links'] = True return render_to_response( 'movie_reviews/pg-rank.html', RequestContext(request, context)) #calc pgranks pgrank(pk) #load pgranks in descending order of pagerank pages_ordered = s.pages.all().filter(review=True).order_by('-new_rank') context['pages'] = pages_ordered return render_to_response( 'movie_reviews/pg-rank.html', RequestContext(request, context))
This code calls the crawler to collect all the linked pages to the reviews and calculate the PageRank scores using the code discussed earlier. Then the scores are displayed in the pg-rank.html
page (in descending order by page rank score) as we showed in the Application usage overview section of this chapter. Since this function can take a long time to process (to crawl thousands of pages), the command run_scrapelinks.py
has been written to run the Scrapy crawler (the reader is invited to read or modify the script as they like as an exercise).
18.191.74.66