Integrating Django with Scrapy

To make paths easy to call, we remove the external scrapy_spider folder so that inside the movie_reviews_analyzer_app, the webmining_server folder is at the same level as the scrapy_spider folder:

├── db.sqlite3
├── scrapy.cfg
├── scrapy_spider
│   ├── ...
│   ├── spiders
│   │   ...
└── webmining_server

We set the Django path into the Scrapy settings.py file:

# Setting up django's project full path.
import sys
sys.path.insert(0, BASE_DIR+'/webmining_server')
# Setting up django's settings module name.
os.environ['DJANGO_SETTINGS_MODULE'] = 'webmining_server.settings'
#import django to load models(otherwise AppRegistryNotReady: Models aren't loaded yet):
import django
django.setup()

Now we can install the library that will allow managing Django models from Scrapy:

sudo pip install scrapy-djangoitem

In the items.py file, we write the links between Django models and Scrapy items as follows:

from scrapy_djangoitem import DjangoItem
from pages.models import Page,Link,SearchTerm

class SearchItem(DjangoItem):
    django_model = SearchTerm
class PageItem(DjangoItem):
    django_model = Page
class LinkItem(DjangoItem):
    django_model = Link

Each class inherits the DjangoItem class so that the original Django models declared with the django_model variable are automatically linked. The Scrapy project is now completed so we can continue our discussion explaining the Django codes that handle the data extracted by Scrapy and the Django commands needed to manage the applications.

Commands (sentiment analysis model and delete queries)

The application needs to manage some operations that are not allowed to the final user of the service, such as defining a sentiment analysis model and deleting a query of a movie in order to redo it instead of retrieving the existing data from memory. The following sections will explain the commands to perform these actions.

Sentiment analysis model loader

The final goal of this application is to determine the sentiment (positive or negative) of the movie reviews. To achieve that, a sentiment classifier must be built using some external data, and then it should be stored in memory (cache) to be used by each query request. This is the purpose of the load_sentimentclassifier.py command displayed hereafter:

import nltk.classify.util, nltk.metrics
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import movie_reviews
from nltk.corpus import stopwords
from nltk.collocations import BigramCollocationFinder
from nltk.metrics import BigramAssocMeasures
from nltk.probability import FreqDist, ConditionalFreqDist
import collections
from django.core.management.base import BaseCommand, CommandError
from optparse import make_option
from django.core.cache import cache

stopwords = set(stopwords.words('english'))
method_selfeatures = 'best_words_features'

class Command(BaseCommand):
    option_list = BaseCommand.option_list + (
                make_option('-n', '--num_bestwords',	
                             dest='num_bestwords', type='int',
                             action='store',
                             help=('number of words with high information')),)
    
    def handle(self, *args, **options):
         num_bestwords = options['num_bestwords']
         self.bestwords = self.GetHighInformationWordsChi(num_bestwords)
         clf = self.train_clf(method_selfeatures)
         cache.set('clf',clf)
         cache.set('bestwords',self.bestwords)

At the beginning of the file, the variable method_selfeatures sets the method of feature selection (in this case, the features are the words in the reviews; see Chapter 4, Web Mining Techniques, for further details) used to train the classifier train_clf. The maximum number of best words (features) is defined by the input parameter num_bestwords. The classifier and the best features (bestwords) are then stored in the cache ready to be used by the application (using the cache module). The classifier and the methods to select the best words (features) are as follows:

    def train_clf(method):
        negidxs = movie_reviews.fileids('neg')
        posidxs = movie_reviews.fileids('pos')
        if method=='stopword_filtered_words_features':
            negfeatures = [(stopword_filtered_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs]
            posfeatures = [(stopword_filtered_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs]
        elif method=='best_words_features':
            negfeatures = [(best_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs]
            posfeatures = [(best_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs]
        elif method=='best_bigrams_words_features':
            negfeatures = [(best_bigrams_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs]
            posfeatures = [(best_bigrams_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs]
            
        trainfeatures = negfeatures + posfeatures
        clf = NaiveBayesClassifier.train(trainfeatures)
        return clf

    def stopword_filtered_words_features(self,words):
        return dict([(word, True) for word in words if word not in stopwords])

    #eliminate Low Information Features
    def GetHighInformationWordsChi(self,num_bestwords):
        word_fd = FreqDist()
        label_word_fd = ConditionalFreqDist()

        for word in movie_reviews.words(categories=['pos']):
            word_fd[word.lower()] +=1
            label_word_fd['pos'][word.lower()] +=1

        for word in movie_reviews.words(categories=['neg']):
            word_fd[word.lower()] +=1
            label_word_fd['neg'][word.lower()] +=1

        pos_word_count = label_word_fd['pos'].N()
        neg_word_count = label_word_fd['neg'].N()
        total_word_count = pos_word_count + neg_word_count

        word_scores = {}
        for word, freq in word_fd.iteritems():
            pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word],
                (freq, pos_word_count), total_word_count)
            neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word],
                (freq, neg_word_count), total_word_count)
            word_scores[word] = pos_score + neg_score

        best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:num_bestwords]
        bestwords = set([w for w, s in best])
        return bestwords

    def best_words_features(self,words):
        return dict([(word, True) for word in words if word in self.bestwords])
    
    def best_bigrams_word_features(self,words, measure=BigramAssocMeasures.chi_sq, nbigrams=200):
        bigram_finder = BigramCollocationFinder.from_words(words)
        bigrams = bigram_finder.nbest(measure, nbigrams)
        d = dict([(bigram, True) for bigram in bigrams])
        d.update(best_words_features(words))
        return d

Three methods are written to select words in the preceding code:

  • stopword_filtered_words_features: Eliminates the stopwords using the Natural Language Toolkit (NLTK) list of conjunctions and considers the rest as relevant words
  • best_words_features: Using the X2 measure (NLTK library), the most informative words related to positive or negative reviews are selected (see Chapter 4, Web Mining Techniques, for further details)
  • best_bigrams_word_features: Uses the X2 measure (NLTK library) to find the 200 most informative bigrams from the set of words (see Chapter 4, Web Mining Techniques, for further details)

The chosen classifier is the Naive Bayes algorithm (see Chapter 3, Supervised Machine Learning) and the labeled text (positive, negative sentiment) is taken from the NLTK.corpus of movie_reviews. To install it, open a terminal in Python and install movie_reviews from corpus:

nltk.download()--> corpora/movie_reviews corpus

Deleting an already performed query

Since we can specify different parameters (such as the feature selection method, the number of best words, and so on), we may want to perform and store again the sentiment of the reviews with different values. The delete_query command is needed for this purpose and it is as follows:

from pages.models import Link,Page,SearchTerm
from django.core.management.base import BaseCommand, CommandError
from optparse import make_option

class Command(BaseCommand):
    option_list = BaseCommand.option_list + (
                make_option('-s', '--searchid',
                             dest='searchid', type='int',
                             action='store',
                             help=('id of the search term to delete')),)

    def handle(self, *args, **options):
         searchid = options['searchid']
         if searchid == None:
             print "please specify searchid: python manage.py --searchid=--"
             #list
             for sobj in SearchTerm.objects.all():
                 print 'id:',sobj.id,"  term:",sobj.term
         else:
             print 'delete...'
             search_obj = SearchTerm.objects.get(id=searchid)
             pages = search_obj.pages.all()
             pages.delete()
             links = search_obj.links.all()
             links.delete()
             search_obj.delete()

If we run the command without specifying the searchid (the ID of the query), the list of all the queries and related IDs will be shown. After that we can choose which query we want to delete by typing the following:

python manage.py delete_query --searchid=VALUE

We can use the cached sentiment analysis model to show the user the online sentiment of the chosen movie, as we explain in the following section.

Sentiment reviews analyser – Django views and HTML

Most of the code explained in this chapter (commands, Bing search engine, Scrapy, and Django models) is used in the function analyzer in views.py to power the home webpage shown in the Application usage overview section (after declaring the URL in the urls.py file as url(r'^$','webmining_server.views.analyzer') ).

def analyzer(request):
    context = {}

    if request.method == 'POST':
        post_data = request.POST
        query = post_data.get('query', None)
        if query:
            return redirect('%s?%s' % (reverse('webmining_server.views.analyzer'),
                                urllib.urlencode({'q': query})))   
    elif request.method == 'GET':
        get_data = request.GET
        query = get_data.get('q')
        if not query:
            return render_to_response(
                'movie_reviews/home.html', RequestContext(request, context))

        context['query'] = query
        stripped_query = query.strip().lower()
        urls = []
        
        if test_mode:
           urls = parse_bing_results()
        else:
           urls = bing_api(stripped_query)
           
        if len(urls)== 0:
           return render_to_response(
               'movie_reviews/noreviewsfound.html', RequestContext(request, context))
        if not SearchTerm.objects.filter(term=stripped_query).exists():
           s = SearchTerm(term=stripped_query)
           s.save()
           try:
               #scrape
               cmd = 'cd ../scrapy_spider & scrapy crawl scrapy_spider_reviews -a url_list=%s -a search_key=%s' %('"'+str(','.join(urls[:num_reviews]).encode('utf-8'))+'"','"'+str(stripped_query)+'"')
               os.system(cmd)
           except:
               print 'error!'
               s.delete()
        else:
           #collect the pages already scraped 
           s = SearchTerm.objects.get(term=stripped_query)
           
        #calc num pages
        pages = s.pages.all().filter(review=True)
        if len(pages) == 0:
           s.delete()
           return render_to_response(
               'movie_reviews/noreviewsfound.html', RequestContext(request, context))
               
        s.num_reviews = len(pages)
        s.save()
         
        context['searchterm_id'] = int(s.id)

        #train classifier with nltk
        def train_clf(method):
            ...           
        def stopword_filtered_words_features(words):
            ... 
        #Eliminate Low Information Features
        def GetHighInformationWordsChi(num_bestwords):
            ...            
        bestwords = cache.get('bestwords')
        if bestwords == None:
            bestwords = GetHighInformationWordsChi(num_bestwords)
        def best_words_features(words):
            ...       
        def best_bigrams_words_features(words, measure=BigramAssocMeasures.chi_sq, nbigrams=200):
            ...
        clf = cache.get('clf')
        if clf == None:
            clf = train_clf(method_selfeatures)

        cntpos = 0
        cntneg = 0
        for p in pages:
            words = p.content.split(" ")
            feats = best_words_features(words)#bigram_word_features(words)#stopword_filtered_word_feats(words)
            #print feats
            str_sent = clf.classify(feats)
            if str_sent == 'pos':
               p.sentiment = 1
               cntpos +=1
            else:
               p.sentiment = -1
               cntneg +=1
            p.save()

        context['reviews_classified'] = len(pages)
        context['positive_count'] = cntpos
        context['negative_count'] = cntneg
        context['classified_information'] = True
    return render_to_response(
        'movie_reviews/home.html', RequestContext(request, context))

The inserted movie title is stored in the query variable and sent to the bing_api function to collect review's URL. The URL are then scraped calling Scrapy to find the review texts, which are processed using the clf classifier model and the selected most informative words (bestwords) retrieved from the cache (or the same model is generated again in case the cache is empty). The counts of the predicted sentiments of the reviews (positive_counts, negative_counts, and reviews_classified) are then sent back to the home.html (the templates folder) page, which uses the following Google pie chart code:

        <h2 align = Center>Movie Reviews Sentiment Analysis</h2>
        <div class="row">
        <p align = Center><strong>Reviews Classified : {{ reviews_classified }}</strong></p>
        <p align = Center><strong>Positive Reviews : {{ positive_count }}</strong></p>
        <p align = Center><strong> Negative Reviews : {{ negative_count }}</strong></p>
        </div> 
  <section>
      <script type="text/javascript" src="https://www.google.com/jsapi"></script>
      <script type="text/javascript">
        google.load("visualization", "1", {packages:["corechart"]});
        google.setOnLoadCallback(drawChart);
        function drawChart() {
          var data = google.visualization.arrayToDataTable([
            ['Sentiment', 'Number'],
            ['Positive',     {{ positive_count }}],
            ['Negative',      {{ negative_count }}]
          ]);
          var options = { title: 'Sentiment Pie Chart'};
          var chart = new google.visualization.PieChart(document.getElementById('piechart'));
          chart.draw(data, options);
        }
      </script>
        <p align ="Center" id="piechart" style="width: 900px; height: 500px;display: block; margin: 0 auto;text-align: center;" ></p>
      </div>

The function drawChart calls the Google PieChart visualization function, which takes as input the data (the positive and negative counts) to create the pie chart. To have more details about how the HTML code interacts with the Django views, refer to Chapter 6, Getting Started with Django, in the URL and views behind html web pages section. From the result page with the sentiment counts (see the Application usage overview section), the PagerRank relevance of the scraped reviews can be calculated using one of the two links at the bottom of the page. The Django code behind this operation is discussed in the following section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.91.254