Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

PageRank: Django view and the algorithm code

To rank the importance of the online reviews, we have implemented the PageRank algorithm (see Chapter 4, Web Mining Techniques, in the Ranking: PageRank algorithm section) into the application. The pgrank.py file in the pgrank folder within the webmining_server folder implements the algorithm that follows:

from pages.models import Page,SearchTerm

num_iterations = 100000
eps=0.0001
D = 0.85

def pgrank(searchid):
    s = SearchTerm.objects.get(id=int(searchid))
    links = s.links.all()
    from_idxs = [i.from_id for i in links ]
    # Find the idxs that receive page rank 
    links_received = []
    to_idxs = []
    for l in links:
        from_id = l.from_id
        to_id = l.to_id
        if from_id not in from_idxs: continue
        if to_id  not in from_idxs: continue
        links_received.append([from_id,to_id])
        if to_id  not in to_idxs: to_idxs.append(to_id)
        
    pages = s.pages.all()
    prev_ranks = dict()
    for node in from_idxs:
        ptmp  = Page.objects.get(id=node)
        prev_ranks[node] = ptmp.old_rank
        
    conv=1.
    cnt=0
    while conv>eps or cnt<num_iterations:
        next_ranks = dict()
        total = 0.0
        for (node,old_rank) in prev_ranks.items():
            total += old_rank
            next_ranks[node] = 0.0
        
        #find the outbound links and send the pagerank down to each of them
        for (node, old_rank) in prev_ranks.items():
            give_idxs = []
            for (from_id, to_id) in links_received:
                if from_id != node: continue
                if to_id  not in to_idxs: continue
                give_idxs.append(to_id)
            if (len(give_idxs) < 1): continue
            amount = D*old_rank/len(give_idxs)
            for id in give_idxs:
                next_ranks[id] += amount
        tot = 0
        for (node,next_rank) in next_ranks.items():
            tot += next_rank
        const = (1-D)/ len(next_ranks)
        
        for node in next_ranks:
            next_ranks[node] += const
        
        tot = 0
        for (node,old_rank) in next_ranks.items():
            tot += next_rank
        
        difftot = 0
        for (node, old_rank) in prev_ranks.items():
            new_rank = next_ranks[node]
            diff = abs(old_rank-new_rank)
            difftot += diff
        conv= difftot/len(prev_ranks)
        cnt+=1
        prev_ranks = next_ranks

    for (id,new_rank) in next_ranks.items():
        ptmp = Page.objects.get(id=id)
        url = ptmp.url
    
    for (id,new_rank) in next_ranks.items():
        ptmp = Page.objects.get(id=id)
        ptmp.old_rank = ptmp.new_rank
        ptmp.new_rank = new_rank
        ptmp.save()

This code takes all the links stores associated with the given SearchTerm object and implements the PageRank score for each page i at time t, where P(i) is given by the recursive equation:

Here, N is the total number of pages, and PageRank: Django view and the algorithm code (N_j is the number of out links of page j) if page j points to i; otherwise, N is 0. The parameter D is the so-called damping factor (set to 0.85 in the preceding code), and it represents the probability to follow the transition given by the transition matrix A. The equation is iterated until the convergence parameter eps is satisfied or the maximum number of iterations, num_iterations, is reached. The algorithm is called by clicking either scrape and calculate page rank (may take a long time) or calculate page rank links at the bottom of the home.html page after the sentiment of the movie reviews has been displayed. The link is linked to the function pgrank_view in the views.py (through the declared URL in urls.py: url(r'^pg-rank/(?P<pk>d+)/','webmining_server.views.pgrank_view', name='pgrank_view')):

def pgrank_view(request,pk): 
    context = {}
    get_data = request.GET
    scrape = get_data.get('scrape','False')
    s = SearchTerm.objects.get(id=pk)
    
    if scrape == 'True':
        pages = s.pages.all().filter(review=True)
        urls = []
        for u in pages:
            urls.append(u.url)
        #crawl
        cmd = 'cd ../scrapy_spider & scrapy crawl scrapy_spider_recursive -a url_list=%s -a search_id=%s' %('"'+str(','.join(urls[:]).encode('utf-8'))+'"','"'+str(pk)+'"')
        os.system(cmd)

    links = s.links.all()
    if len(links)==0:
       context['no_links'] = True
       return render_to_response(
           'movie_reviews/pg-rank.html', RequestContext(request, context))
    #calc pgranks
    pgrank(pk)
    #load pgranks in descending order of pagerank
    pages_ordered = s.pages.all().filter(review=True).order_by('-new_rank')
    context['pages'] = pages_ordered
    
    return render_to_response(
        'movie_reviews/pg-rank.html', RequestContext(request, context))

This code calls the crawler to collect all the linked pages to the reviews and calculate the PageRank scores using the code discussed earlier. Then the scores are displayed in the pg-rank.html page (in descending order by page rank score) as we showed in the Application usage overview section of this chapter. Since this function can take a long time to process (to crawl thousands of pages), the command run_scrapelinks.py has been written to run the Scrapy crawler (the reader is invited to read or modify the script as they like as an exercise).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for PageRank: Django view and the algorithm code

Create new playlist

Sign In

Sign Up

PageRank: Django view and the algorithm code

Table of Contents for
PageRank: Django view and the algorithm code