Scrapy is a Python library is used to extract content from web pages or to crawl pages linked to a given web page (see the Web crawlers (or spiders) section of Chapter 4, Web Mining Techniques, for more details). To install the library, type the following in the terminal:
sudo pip install Scrapy
Install the executable in the bin
folder:
sudo easy_install scrapy
From the movie_reviews_analyzer_app
folder, we initialize our Scrapy project as follows:
scrapy startproject scrapy_spider
This command will create the following tree inside the scrapy_spider
folder:
├── __init__.py ├── items.py ├── pipelines.py ├── settings.py ├── spiders ├── spiders │ ├── __init__.py
The pipelines.py
and items.py
files manage how the scraped data is stored and manipulated, and they will be discussed later in the Spiders and Integrate Django with Scrapy sections. The settings.py
file sets the parameters each spider (or crawler) defined in the spiders
folder uses to operate. In the following two sections, we describe the main parameters and spiders used in this application.
The settings.py
file collects all the parameters used by each spider in the Scrapy project to scrape web pages. The main parameters are as follows:
DEPTH_LIMIT
: The number of subsequent pages crawled following an initial URL. The default is 0
and it means that no limit is set.LOG_ENABLED
: To allow/deny Scrapy to log on the terminal while executing default is true.ITEM_PIPELINES = {'scrapy_spider.pipelines.ReviewPipeline': 1000,}
: The path of the pipeline function to manipulate data extracted from each web page.CONCURRENT_ITEMS = 200
: The number of concurrent items processed in the pipeline.CONCURRENT_REQUESTS = 5000
: The maximum number of simultaneous requests handled by Scrapy.CONCURRENT_REQUESTS_PER_DOMAIN = 3000
: The maximum number of simultaneous requests handled by Scrapy for each specified domain.The larger the depth, more the pages are scraped and, consequently, the time needed to scrape increases. To speed up the process, you can set high value on the last three parameters. In this application (the spiders
folder), we set two spiders: a scraper to extract data from each movie review URL (movie_link_results.py
) and a crawler to generate a graph of webpages linked to the initial movie review URL (recursive_link_results.py
).
The scraper on movie_link_results.py
looks as follows:
from newspaper import Article from urlparse import urlparse from scrapy.selector import Selector from scrapy import Spider from scrapy.spiders import BaseSpider,CrawlSpider, Rule from scrapy.http import Request from scrapy_spider import settings from scrapy_spider.items import PageItem,SearchItem unwanted_domains = ['youtube.com','www.youtube.com'] from nltk.corpus import stopwords stopwords = set(stopwords.words('english')) def CheckQueryinReview(keywords,title,content): content_list = map(lambda x:x.lower(),content.split(' ')) title_list = map(lambda x:x.lower(),title.split(' ')) words = content_list+title_list for k in keywords: if k in words: return True return False class Search(Spider): name = 'scrapy_spider_reviews' def __init__(self,url_list,search_key):#specified by -a self.search_key = search_key self.keywords = [w.lower() for w in search_key.split(" ") if w not in stopwords] self.start_urls =url_list.split(',') super(Search, self).__init__(url_list) def start_requests(self): for url in self.start_urls: yield Request(url=url, callback=self.parse_site,dont_filter=True) def parse_site(self, response): ## Get the selector for xpath parsing or from newspaper def crop_emptyel(arr): return [u for u in arr if u!=' '] domain = urlparse(response.url).hostname a = Article(response.url) a.download() a.parse() title = a.title.encode('ascii','ignore').replace(' ','') sel = Selector(response) if title==None: title = sel.xpath('//title/text()').extract() if len(title)>0: title = title[0].encode('utf-8').strip().lower() content = a.text.encode('ascii','ignore').replace(' ','') if content == None: content = 'none' if len(crop_emptyel(sel.xpath('//div//article//p/text()').extract()))>1: contents = crop_emptyel(sel.xpath('//div//article//p/text()').extract()) print 'divarticle' …. elif len(crop_emptyel(sel.xpath('/html/head/meta[@name="description"]/@content').extract()))>0: contents = crop_emptyel(sel.xpath('/html/head/meta[@name="description"]/@content').extract()) content = ' '.join([c.encode('utf-8') for c in contents]).strip().lower() #get search item search_item = SearchItem.django_model.objects.get(term=self.search_key) #save item if not PageItem.django_model.objects.filter(url=response.url).exists(): if len(content) > 0: if CheckQueryinReview(self.keywords,title,content): if domain not in unwanted_domains: newpage = PageItem() newpage['searchterm'] = search_item newpage['title'] = title newpage['content'] = content newpage['url'] = response.url newpage['depth'] = 0 newpage['review'] = True #newpage.save() return newpage else: return null
We can see that the Spider
class from scrapy
is inherited by the Search
class and the following standard methods have to be defined to override the standard methods:
__init__
: The constructor of the spider needs to define the start_urls
list that contains the URL to extract content from. In addition, we have custom variables such as search_key
and keywords
that store the information related to the query of the movie's title used on the search engine API.start_requests
: This function is triggered when spider
is called and it declares what to do for each URL in the start_urls
list; for each URL, the custom parse_site
function will be called (instead of the default parse
function).parse_site
: It is a custom function to parse data from each URL. To extract the title of the review and its text content, we used the newspaper library (sudo pip install newspaper
) or, if it fails, we parse the HTML file directly using some defined rules to avoid the noise due to undesired tags (each rule structure is defined with the sel.xpath
command). To achieve this result, we select some popular domains (rottentomatoes
, cnn
, and so on) and ensure the parsing is able to extract the content from these websites (not all the extraction rules are displayed in the preceding code but they can be found as usual in the GitHub file). The data is then stored in a page Django
model using the related Scrapy item and the ReviewPipeline
function (see the following section).CheckQueryinReview
: This is a custom function to check whether the movie title (from the query) is contained in the content or title of each web page.To run the spider, we need to type in the following command from the scrapy_spider
(internal) folder:
scrapy crawl scrapy_spider_reviews -a url_list=listname -a search_key=keyname
The pipelines define what to do when a new page is scraped by the spider. In the preceding case, the parse_site
function returns a PageItem
object, which triggers the following pipeline (pipelines.py
):
class ReviewPipeline(object): def process_item(self, item, spider): #if spider.name == 'scrapy_spider_reviews':#not working item.save() return item
This class simply saves each item (a new page in the spider notation).
As we showed in the overview (the preceding section), the relevance of the review is calculated using the PageRank algorithm after we have stored all the linked pages starting from the review's URL. The crawler recursive_link_results.py
performs this operation:
#from scrapy.spider import Spider from scrapy.selector import Selector from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.linkextractors import LinkExtractor from scrapy.http import Request from scrapy_spider.items import PageItem,LinkItem,SearchItem class Search(CrawlSpider): name = 'scrapy_spider_recursive' def __init__(self,url_list,search_id):#specified by -a #REMARK is allowed_domains is not set then ALL are allowed!!! self.start_urls = url_list.split(',') self.search_id = int(search_id) #allow any link but the ones with different font size(repetitions) self.rules = ( Rule(LinkExtractor(allow=(),deny=('fontSize=*','infoid=*','SortBy=*', ),unique=True), callback='parse_item', follow=True), ) super(Search, self).__init__(url_list) def parse_item(self, response): sel = Selector(response) ## Get meta info from website title = sel.xpath('//title/text()').extract() if len(title)>0: title = title[0].encode('utf-8') contents = sel.xpath('/html/head/meta[@name="description"]/@content').extract() content = ' '.join([c.encode('utf-8') for c in contents]).strip() fromurl = response.request.headers['Referer'] tourl = response.url depth = response.request.meta['depth'] #get search item search_item = SearchItem.django_model.objects.get(id=self.search_id) #newpage if not PageItem.django_model.objects.filter(url=tourl).exists(): newpage = PageItem() newpage['searchterm'] = search_item newpage['title'] = title newpage['content'] = content newpage['url'] = tourl newpage['depth'] = depth newpage.save()#cant use pipeline cause the execution can finish here #get from_id,to_id from_page = PageItem.django_model.objects.get(url=fromurl) from_id = from_page.id to_page = PageItem.django_model.objects.get(url=tourl) to_id = to_page.id #newlink if not LinkItem.django_model.objects.filter(from_id=from_id).filter(to_id=to_id).exists(): newlink = LinkItem() newlink['searchterm'] = search_item newlink['from_id'] = from_id newlink['to_id'] = to_id newlink.save()
The CrawlSpider
class from scrapy
is inherited by the Search
class, and the following standard methods have to be defined to override the standard methods (as for the spider case):
__init__
: The is a constructor of the class. The start_urls
parameter defines the starting URL from which the spider will start to crawl until the DEPTH_LIMIT
value is reached. The rules
parameter sets the type of URL allowed/denied to scrape (in this case, the same page but with different font sizes is disregarded) and it defines the function to call to manipulate each retrieved page (parse_item
). Also, a custom variable search_id
is defined, which is needed to store the ID of the query within the other data.parse_item
: This is a custom function called to store the important data from each retrieved page. A new Django item of the Page
model (see the following section) from each page is created, which contains the title and content of the page (using the xpath
HTML parser). To perform the PageRank algorithm, the connection from the page that links to each page and the page itself is saved as an object of the Link
model using the related Scrapy item (see the following sections).To run the crawler, we need to type the following from the (internal) scrapy_spider
folder:
scrapy crawl scrapy_spider_recursive -a url_list=listname -a search_id=keyname
3.146.221.144