Chapter 8. Sentiment Analyser Application for Movie Reviews

In this chapter, we describe an application to determine the sentiment of movie reviews using algorithms and methods described throughout the book. In addition, the Scrapy library will be used to collect reviews from different websites through a search engine API (Bing search engine). The text and the title of the movie review is extracted using the newspaper library or following some pre-defined extraction rules of an HTML format page. The sentiment of each review is determined using a naive Bayes classifier on the most informative words (using the X2 measure) in the same way as in Chapter 4, Web Mining Techniques. Also, the rank of each page related to each movie query is calculated for completeness using the PageRank algorithm discussed in Chapter 4, Web Mining Techniques. This chapter will discuss the code used to build the application, including the Django models and views and the Scrapy scraper is used to collect data from the web pages of the movie reviews. We start by giving an example of what the web application will be and explaining the search engine API used and how we include it in the application. We then describe how we collect the movie reviews, integrating the Scrapy library into Django, the models to store the data, and the main commands to manage the application. All the code discussed in this chapter is available in the GitHub repository of the author inside the chapter_8 folder at https://github.com/ai2010/machine_learning_for_the_web/tree/master/chapter_8.

Application usage overview

The home web page is as follows:

Application usage overview

The user can type in the movie name, if they want to know the review's sentiments and relevance. For example, we look for Batman vs Superman Dawn of Justice in the following screenshot:

Application usage overview

The application collects and scrapes 18 reviews from the Bing search engine and, using the Scrapy library, it analyzes their sentiment (15 positive and 3 negative). All data is stored in Django models, ready to be used to calculate the relevance of each page using the PageRank algorithm (the links at the bottom of the page as seen in the preceding screenshot). In this case, using the PageRank algorithm, we have the following:

Application usage overview

This is a list of the most relevant pages to our movie review search, setting a depth parameter 2 on the scraping crawler (refer the following section for further details). Note that to have a good result on page relevance, you have to crawl thousands of pages (the preceding screenshot shows results for around 50 crawled pages).

To write the application, we start the server as usual (see Chapter 6, Getting Started with Django, and Chapter 7, Movie Recommendation System Web Application) and the main app in Django. First, we create a folder to store all our codes, movie_reviews_analyzer_app, and then we initialize Django using the following command:

mkdir  movie_reviews_analyzer_app
cd  movie_reviews_analyzer_app
django-admin startproject webmining_server
python manage.py startapp startapp pages

We set the settings in the .py file as we did in the Settings section of Chapter 6, Getting Started with Django, and the Application Setup section of Chapter 7, Movie Recommendation System Web Application (of course, in this case the name is webmining_server instead of server_movierecsys).

The sentiment analyzer application has the main views in the .py file in the main webmining_server folder instead of the app (pages) folder as we did previously (see Chapter 6, Getting Started with Django, and Chapter 7, Movie Recommendation System Web Application), because the functions now refer more to the general functioning of the server instead of the specific app (pages).

The last operation to make the web service operational is to create a superuser account and go live with the server:

python manage.py createsuperuser (admin/admin)
python manage.py runserver

Now that the structure of the application has been explained, we can discuss the different parts in more detail starting from the search engine API used to collect URLs.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.237.164