To make paths easy to call, we remove the external scrapy_spider
folder so that inside the movie_reviews_analyzer_app
, the webmining_server
folder is at the same level as the scrapy_spider
folder:
├── db.sqlite3 ├── scrapy.cfg ├── scrapy_spider │ ├── ... │ ├── spiders │ │ ... └── webmining_server
We set the Django path into the Scrapy settings.py
file:
# Setting up django's project full path. import sys sys.path.insert(0, BASE_DIR+'/webmining_server') # Setting up django's settings module name. os.environ['DJANGO_SETTINGS_MODULE'] = 'webmining_server.settings' #import django to load models(otherwise AppRegistryNotReady: Models aren't loaded yet): import django django.setup()
Now we can install the library that will allow managing Django models from Scrapy:
sudo pip install scrapy-djangoitem
In the items.py
file, we write the links between Django models and Scrapy items as follows:
from scrapy_djangoitem import DjangoItem from pages.models import Page,Link,SearchTerm class SearchItem(DjangoItem): django_model = SearchTerm class PageItem(DjangoItem): django_model = Page class LinkItem(DjangoItem): django_model = Link
Each class inherits the DjangoItem
class so that the original Django models declared with the django_model
variable are automatically linked. The Scrapy project is now completed so we can continue our discussion explaining the Django codes that handle the data extracted by Scrapy and the Django commands needed to manage the applications.
The application needs to manage some operations that are not allowed to the final user of the service, such as defining a sentiment analysis model and deleting a query of a movie in order to redo it instead of retrieving the existing data from memory. The following sections will explain the commands to perform these actions.
The final goal of this application is to determine the sentiment (positive or negative) of the movie reviews. To achieve that, a sentiment classifier must be built using some external data, and then it should be stored in memory (cache) to be used by each query request. This is the purpose of the load_sentimentclassifier.py
command displayed hereafter:
import nltk.classify.util, nltk.metrics from nltk.classify import NaiveBayesClassifier from nltk.corpus import movie_reviews from nltk.corpus import stopwords from nltk.collocations import BigramCollocationFinder from nltk.metrics import BigramAssocMeasures from nltk.probability import FreqDist, ConditionalFreqDist import collections from django.core.management.base import BaseCommand, CommandError from optparse import make_option from django.core.cache import cache stopwords = set(stopwords.words('english')) method_selfeatures = 'best_words_features' class Command(BaseCommand): option_list = BaseCommand.option_list + ( make_option('-n', '--num_bestwords', dest='num_bestwords', type='int', action='store', help=('number of words with high information')),) def handle(self, *args, **options): num_bestwords = options['num_bestwords'] self.bestwords = self.GetHighInformationWordsChi(num_bestwords) clf = self.train_clf(method_selfeatures) cache.set('clf',clf) cache.set('bestwords',self.bestwords)
At the beginning of the file, the variable method_selfeatures
sets the method of feature selection (in this case, the features are the words in the reviews; see Chapter 4, Web Mining Techniques, for further details) used to train the classifier train_clf
. The maximum number of best words (features) is defined by the input parameter num_bestwords
. The classifier and the best features (bestwords
) are then stored in the cache ready to be used by the application (using the cache
module). The classifier and the methods to select the best words (features) are as follows:
def train_clf(method): negidxs = movie_reviews.fileids('neg') posidxs = movie_reviews.fileids('pos') if method=='stopword_filtered_words_features': negfeatures = [(stopword_filtered_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs] posfeatures = [(stopword_filtered_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs] elif method=='best_words_features': negfeatures = [(best_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs] posfeatures = [(best_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs] elif method=='best_bigrams_words_features': negfeatures = [(best_bigrams_words_features(movie_reviews.words(fileids=[file])), 'neg') for file in negidxs] posfeatures = [(best_bigrams_words_features(movie_reviews.words(fileids=[file])), 'pos') for file in posidxs] trainfeatures = negfeatures + posfeatures clf = NaiveBayesClassifier.train(trainfeatures) return clf def stopword_filtered_words_features(self,words): return dict([(word, True) for word in words if word not in stopwords]) #eliminate Low Information Features def GetHighInformationWordsChi(self,num_bestwords): word_fd = FreqDist() label_word_fd = ConditionalFreqDist() for word in movie_reviews.words(categories=['pos']): word_fd[word.lower()] +=1 label_word_fd['pos'][word.lower()] +=1 for word in movie_reviews.words(categories=['neg']): word_fd[word.lower()] +=1 label_word_fd['neg'][word.lower()] +=1 pos_word_count = label_word_fd['pos'].N() neg_word_count = label_word_fd['neg'].N() total_word_count = pos_word_count + neg_word_count word_scores = {} for word, freq in word_fd.iteritems(): pos_score = BigramAssocMeasures.chi_sq(label_word_fd['pos'][word], (freq, pos_word_count), total_word_count) neg_score = BigramAssocMeasures.chi_sq(label_word_fd['neg'][word], (freq, neg_word_count), total_word_count) word_scores[word] = pos_score + neg_score best = sorted(word_scores.iteritems(), key=lambda (w,s): s, reverse=True)[:num_bestwords] bestwords = set([w for w, s in best]) return bestwords def best_words_features(self,words): return dict([(word, True) for word in words if word in self.bestwords]) def best_bigrams_word_features(self,words, measure=BigramAssocMeasures.chi_sq, nbigrams=200): bigram_finder = BigramCollocationFinder.from_words(words) bigrams = bigram_finder.nbest(measure, nbigrams) d = dict([(bigram, True) for bigram in bigrams]) d.update(best_words_features(words)) return d
Three methods are written to select words in the preceding code:
stopword_filtered_words_features
: Eliminates the stopwords
using the Natural Language Toolkit (NLTK) list of conjunctions and considers the rest as relevant wordsbest_words_features
: Using the X2 measure (NLTK
library), the most informative words related to positive or negative reviews are selected (see Chapter 4, Web Mining Techniques, for further details)best_bigrams_word_features
: Uses the X2 measure (NLTK
library) to find the 200 most informative bigrams from the set of words (see Chapter 4, Web Mining Techniques, for further details)The chosen classifier is the Naive Bayes algorithm (see Chapter 3, Supervised Machine Learning) and the labeled text (positive, negative sentiment) is taken from the NLTK.corpus
of movie_reviews
. To install it, open a terminal in Python and install movie_reviews
from corpus
:
nltk.download()--> corpora/movie_reviews corpus
Since we can specify different parameters (such as the feature selection method, the number of best words, and so on), we may want to perform and store again the sentiment of the reviews with different values. The delete_query
command is needed for this purpose and it is as follows:
from pages.models import Link,Page,SearchTerm from django.core.management.base import BaseCommand, CommandError from optparse import make_option class Command(BaseCommand): option_list = BaseCommand.option_list + ( make_option('-s', '--searchid', dest='searchid', type='int', action='store', help=('id of the search term to delete')),) def handle(self, *args, **options): searchid = options['searchid'] if searchid == None: print "please specify searchid: python manage.py --searchid=--" #list for sobj in SearchTerm.objects.all(): print 'id:',sobj.id," term:",sobj.term else: print 'delete...' search_obj = SearchTerm.objects.get(id=searchid) pages = search_obj.pages.all() pages.delete() links = search_obj.links.all() links.delete() search_obj.delete()
If we run the command without specifying the searchid
(the ID of the query), the list of all the queries and related IDs will be shown. After that we can choose which query we want to delete by typing the following:
python manage.py delete_query --searchid=VALUE
We can use the cached sentiment analysis model to show the user the online sentiment of the chosen movie, as we explain in the following section.
Most of the code explained in this chapter (commands, Bing search engine, Scrapy, and Django models) is used in the function analyzer in views.py
to power the home webpage shown in the Application usage overview section (after declaring the URL in the urls.py
file as url(r'^$','webmining_server.views.analyzer')
).
def analyzer(request): context = {} if request.method == 'POST': post_data = request.POST query = post_data.get('query', None) if query: return redirect('%s?%s' % (reverse('webmining_server.views.analyzer'), urllib.urlencode({'q': query}))) elif request.method == 'GET': get_data = request.GET query = get_data.get('q') if not query: return render_to_response( 'movie_reviews/home.html', RequestContext(request, context)) context['query'] = query stripped_query = query.strip().lower() urls = [] if test_mode: urls = parse_bing_results() else: urls = bing_api(stripped_query) if len(urls)== 0: return render_to_response( 'movie_reviews/noreviewsfound.html', RequestContext(request, context)) if not SearchTerm.objects.filter(term=stripped_query).exists(): s = SearchTerm(term=stripped_query) s.save() try: #scrape cmd = 'cd ../scrapy_spider & scrapy crawl scrapy_spider_reviews -a url_list=%s -a search_key=%s' %('"'+str(','.join(urls[:num_reviews]).encode('utf-8'))+'"','"'+str(stripped_query)+'"') os.system(cmd) except: print 'error!' s.delete() else: #collect the pages already scraped s = SearchTerm.objects.get(term=stripped_query) #calc num pages pages = s.pages.all().filter(review=True) if len(pages) == 0: s.delete() return render_to_response( 'movie_reviews/noreviewsfound.html', RequestContext(request, context)) s.num_reviews = len(pages) s.save() context['searchterm_id'] = int(s.id) #train classifier with nltk def train_clf(method): ... def stopword_filtered_words_features(words): ... #Eliminate Low Information Features def GetHighInformationWordsChi(num_bestwords): ... bestwords = cache.get('bestwords') if bestwords == None: bestwords = GetHighInformationWordsChi(num_bestwords) def best_words_features(words): ... def best_bigrams_words_features(words, measure=BigramAssocMeasures.chi_sq, nbigrams=200): ... clf = cache.get('clf') if clf == None: clf = train_clf(method_selfeatures) cntpos = 0 cntneg = 0 for p in pages: words = p.content.split(" ") feats = best_words_features(words)#bigram_word_features(words)#stopword_filtered_word_feats(words) #print feats str_sent = clf.classify(feats) if str_sent == 'pos': p.sentiment = 1 cntpos +=1 else: p.sentiment = -1 cntneg +=1 p.save() context['reviews_classified'] = len(pages) context['positive_count'] = cntpos context['negative_count'] = cntneg context['classified_information'] = True return render_to_response( 'movie_reviews/home.html', RequestContext(request, context))
The inserted movie title is stored in the query
variable and sent to the bing_api
function to collect review's URL. The URL are then scraped calling Scrapy to find the review texts, which are processed using the clf
classifier model and the selected most informative words (bestwords
) retrieved from the cache (or the same model is generated again in case the cache is empty). The counts of the predicted sentiments of the reviews (positive_counts
, negative_counts
, and reviews_classified
) are then sent back to the home.html
(the templates
folder) page, which uses the following Google pie chart code:
<h2 align = Center>Movie Reviews Sentiment Analysis</h2> <div class="row"> <p align = Center><strong>Reviews Classified : {{ reviews_classified }}</strong></p> <p align = Center><strong>Positive Reviews : {{ positive_count }}</strong></p> <p align = Center><strong> Negative Reviews : {{ negative_count }}</strong></p> </div> <section> <script type="text/javascript" src="https://www.google.com/jsapi"></script> <script type="text/javascript"> google.load("visualization", "1", {packages:["corechart"]}); google.setOnLoadCallback(drawChart); function drawChart() { var data = google.visualization.arrayToDataTable([ ['Sentiment', 'Number'], ['Positive', {{ positive_count }}], ['Negative', {{ negative_count }}] ]); var options = { title: 'Sentiment Pie Chart'}; var chart = new google.visualization.PieChart(document.getElementById('piechart')); chart.draw(data, options); } </script> <p align ="Center" id="piechart" style="width: 900px; height: 500px;display: block; margin: 0 auto;text-align: center;" ></p> </div>
The function drawChart
calls the Google PieChart
visualization function, which takes as input the data (the positive and negative counts) to create the pie chart. To have more details about how the HTML code interacts with the Django views, refer to Chapter 6, Getting Started with Django, in the URL and views behind html web pages section. From the result page with the sentiment counts (see the Application usage overview section), the PagerRank relevance of the scraped reviews can be calculated using one of the two links at the bottom of the page. The Django code behind this operation is discussed in the following section.
3.145.91.254