Scrapy setup and the application code

Scrapy is a Python library is used to extract content from web pages or to crawl pages linked to a given web page (see the Web crawlers (or spiders) section of Chapter 4, Web Mining Techniques, for more details). To install the library, type the following in the terminal:

sudo pip install Scrapy 

Install the executable in the bin folder:

sudo easy_install scrapy

From the movie_reviews_analyzer_app folder, we initialize our Scrapy project as follows:

scrapy startproject scrapy_spider

This command will create the following tree inside the scrapy_spider folder:

├── __init__.py
├── items.py
├── pipelines.py
├── settings.py
├── spiders
├── spiders
│   ├── __init__.py

The pipelines.py and items.py files manage how the scraped data is stored and manipulated, and they will be discussed later in the Spiders and Integrate Django with Scrapy sections. The settings.py file sets the parameters each spider (or crawler) defined in the spiders folder uses to operate. In the following two sections, we describe the main parameters and spiders used in this application.

Scrapy settings

The settings.py file collects all the parameters used by each spider in the Scrapy project to scrape web pages. The main parameters are as follows:

  • DEPTH_LIMIT: The number of subsequent pages crawled following an initial URL. The default is 0 and it means that no limit is set.
  • LOG_ENABLED: To allow/deny Scrapy to log on the terminal while executing default is true.
  • ITEM_PIPELINES = {'scrapy_spider.pipelines.ReviewPipeline': 1000,}: The path of the pipeline function to manipulate data extracted from each web page.
  • CONCURRENT_ITEMS = 200: The number of concurrent items processed in the pipeline.
  • CONCURRENT_REQUESTS = 5000: The maximum number of simultaneous requests handled by Scrapy.
  • CONCURRENT_REQUESTS_PER_DOMAIN = 3000: The maximum number of simultaneous requests handled by Scrapy for each specified domain.

The larger the depth, more the pages are scraped and, consequently, the time needed to scrape increases. To speed up the process, you can set high value on the last three parameters. In this application (the spiders folder), we set two spiders: a scraper to extract data from each movie review URL (movie_link_results.py) and a crawler to generate a graph of webpages linked to the initial movie review URL (recursive_link_results.py).

Scraper

The scraper on movie_link_results.py looks as follows:

from newspaper import Article
from urlparse import urlparse
from scrapy.selector import Selector
from scrapy import Spider
from scrapy.spiders import BaseSpider,CrawlSpider, Rule
from scrapy.http import Request
from scrapy_spider import settings
from scrapy_spider.items import PageItem,SearchItem

unwanted_domains = ['youtube.com','www.youtube.com']
from nltk.corpus import stopwords
stopwords = set(stopwords.words('english'))

def CheckQueryinReview(keywords,title,content):
    content_list = map(lambda x:x.lower(),content.split(' '))
    title_list = map(lambda x:x.lower(),title.split(' '))
    words = content_list+title_list
    for k in keywords:
        if k in words:
            return True
    return False

class Search(Spider):
    name = 'scrapy_spider_reviews'
    
    def __init__(self,url_list,search_key):#specified by -a
        self.search_key = search_key
        self.keywords = [w.lower() for w in search_key.split(" ") if w not in stopwords]
        self.start_urls =url_list.split(',')
        super(Search, self).__init__(url_list)
    
    def start_requests(self):
        for url in self.start_urls:
            yield Request(url=url, callback=self.parse_site,dont_filter=True)
                        
    def parse_site(self, response):
        ## Get the selector for xpath parsing or from newspaper
        
        def crop_emptyel(arr):
            return [u for u in arr if u!=' ']
        
        domain = urlparse(response.url).hostname
        a = Article(response.url)
        a.download()
        a.parse()
        title = a.title.encode('ascii','ignore').replace('
','')
        sel = Selector(response)
        if title==None:
            title = sel.xpath('//title/text()').extract()
            if len(title)>0:
                title = title[0].encode('utf-8').strip().lower()
                
        content = a.text.encode('ascii','ignore').replace('
','')
        if content == None:
            content = 'none'
            if len(crop_emptyel(sel.xpath('//div//article//p/text()').extract()))>1:
                contents = crop_emptyel(sel.xpath('//div//article//p/text()').extract())
                print 'divarticle'
            ….
            elif len(crop_emptyel(sel.xpath('/html/head/meta[@name="description"]/@content').extract()))>0:
                contents = crop_emptyel(sel.xpath('/html/head/meta[@name="description"]/@content').extract())
            content = ' '.join([c.encode('utf-8') for c in contents]).strip().lower()
                
        #get search item 
        search_item = SearchItem.django_model.objects.get(term=self.search_key)
        #save item
        if not PageItem.django_model.objects.filter(url=response.url).exists():
            if len(content) > 0:
                if CheckQueryinReview(self.keywords,title,content):
                    if domain not in unwanted_domains:
                        newpage = PageItem()
                        newpage['searchterm'] = search_item
                        newpage['title'] = title
                        newpage['content'] = content
                        newpage['url'] = response.url
                        newpage['depth'] = 0
                        newpage['review'] = True
                        #newpage.save()
                        return newpage  
        else:
            return null

We can see that the Spider class from scrapy is inherited by the Search class and the following standard methods have to be defined to override the standard methods:

  • __init__: The constructor of the spider needs to define the start_urls list that contains the URL to extract content from. In addition, we have custom variables such as search_key and keywords that store the information related to the query of the movie's title used on the search engine API.
  • start_requests: This function is triggered when spider is called and it declares what to do for each URL in the start_urls list; for each URL, the custom parse_site function will be called (instead of the default parse function).
  • parse_site: It is a custom function to parse data from each URL. To extract the title of the review and its text content, we used the newspaper library (sudo pip install newspaper) or, if it fails, we parse the HTML file directly using some defined rules to avoid the noise due to undesired tags (each rule structure is defined with the sel.xpath command). To achieve this result, we select some popular domains (rottentomatoes, cnn, and so on) and ensure the parsing is able to extract the content from these websites (not all the extraction rules are displayed in the preceding code but they can be found as usual in the GitHub file). The data is then stored in a page Django model using the related Scrapy item and the ReviewPipeline function (see the following section).
  • CheckQueryinReview: This is a custom function to check whether the movie title (from the query) is contained in the content or title of each web page.

To run the spider, we need to type in the following command from the scrapy_spider (internal) folder:

scrapy crawl scrapy_spider_reviews -a url_list=listname -a search_key=keyname

Pipelines

The pipelines define what to do when a new page is scraped by the spider. In the preceding case, the parse_site function returns a PageItem object, which triggers the following pipeline (pipelines.py):

class ReviewPipeline(object):
    def process_item(self, item, spider):
        #if spider.name == 'scrapy_spider_reviews':#not working
           item.save()
           return item

This class simply saves each item (a new page in the spider notation).

Crawler

As we showed in the overview (the preceding section), the relevance of the review is calculated using the PageRank algorithm after we have stored all the linked pages starting from the review's URL. The crawler recursive_link_results.py performs this operation:

#from scrapy.spider import Spider
from scrapy.selector import Selector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.http import Request

from scrapy_spider.items import PageItem,LinkItem,SearchItem

class Search(CrawlSpider):
    name = 'scrapy_spider_recursive'
    
    def __init__(self,url_list,search_id):#specified by -a
    
        #REMARK is allowed_domains is not set then ALL are allowed!!!
        self.start_urls = url_list.split(',')
        self.search_id = int(search_id)
        
        #allow any link but the ones with different font size(repetitions)
        self.rules = (
            Rule(LinkExtractor(allow=(),deny=('fontSize=*','infoid=*','SortBy=*', ),unique=True), callback='parse_item', follow=True), 
            )
        super(Search, self).__init__(url_list)

    def parse_item(self, response):
        sel = Selector(response)
        
        ## Get meta info from website
        title = sel.xpath('//title/text()').extract()
        if len(title)>0:
            title = title[0].encode('utf-8')
            
        contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract()
        content = ' '.join([c.encode('utf-8') for c in contents]).strip()

        fromurl = response.request.headers['Referer']
        tourl = response.url
        depth = response.request.meta['depth']
        
        #get search item 
        search_item = SearchItem.django_model.objects.get(id=self.search_id)
        #newpage
        if not PageItem.django_model.objects.filter(url=tourl).exists():
            newpage = PageItem()
            newpage['searchterm'] = search_item
            newpage['title'] = title
            newpage['content'] = content
            newpage['url'] = tourl
            newpage['depth'] = depth
            newpage.save()#cant use pipeline cause the execution can finish here
        
        #get from_id,to_id
        from_page = PageItem.django_model.objects.get(url=fromurl)
        from_id = from_page.id
        to_page = PageItem.django_model.objects.get(url=tourl)
        to_id = to_page.id
        
        #newlink
        if not LinkItem.django_model.objects.filter(from_id=from_id).filter(to_id=to_id).exists():
            newlink = LinkItem()
            newlink['searchterm'] = search_item
            newlink['from_id'] = from_id
            newlink['to_id'] = to_id
            newlink.save()

The CrawlSpider class from scrapy is inherited by the Search class, and the following standard methods have to be defined to override the standard methods (as for the spider case):

  • __init__: The is a constructor of the class. The start_urls parameter defines the starting URL from which the spider will start to crawl until the DEPTH_LIMIT value is reached. The rules parameter sets the type of URL allowed/denied to scrape (in this case, the same page but with different font sizes is disregarded) and it defines the function to call to manipulate each retrieved page (parse_item). Also, a custom variable search_id is defined, which is needed to store the ID of the query within the other data.
  • parse_item: This is a custom function called to store the important data from each retrieved page. A new Django item of the Page model (see the following section) from each page is created, which contains the title and content of the page (using the xpath HTML parser). To perform the PageRank algorithm, the connection from the page that links to each page and the page itself is saved as an object of the Link model using the related Scrapy item (see the following sections).

To run the crawler, we need to type the following from the (internal) scrapy_spider folder:

scrapy crawl scrapy_spider_recursive -a url_list=listname -a search_id=keyname
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.209.131