Since scraping directly from the most relevant search engines such as Google, Bing, Yahoo, and others is against their term of service, we need to take initial review pages from their REST API (using scraping services such as Crawlera, http://crawlera.com/, is also possible). We decided to use the Bing service, which allows 5,000 queries per month for free.
In order to do that, we register to the Microsoft Service to obtain the key needed to allow the search. Briefly, we followed these steps:
https://www.
bing.com
)After that, we can write a function that retrieves as many URLs relevant to our query as we want:
num_reviews = 30 def bing_api(query): keyBing = API_KEY # get Bing key from: https://datamarket.azure.com/account/keys credentialBing = 'Basic ' + (':%s' % keyBing).encode('base64')[:-1] # the "-1" is to remove the trailing " " which encode adds searchString = '%27X'+query.replace(" ",'+')+'movie+review%27' top = 50#maximum allowed by Bing reviews_urls = [] if num_reviews<top: offset = 0 url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + 'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, num_reviews, offset) request = urllib2.Request(url) request.add_header('Authorization', credentialBing) requestOpener = urllib2.build_opener() response = requestOpener.open(request) results = json.load(response) reviews_urls = [ d['Url'] for d in results['d']['results']] else: nqueries = int(float(num_reviews)/top)+1 for i in xrange(nqueries): offset = top*i if i==nqueries-1: top = num_reviews-offset url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + 'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, top, offset) request = urllib2.Request(url) request.add_header('Authorization', credentialBing) requestOpener = urllib2.build_opener() response = requestOpener.open(request) else: top=50 url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + 'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, top, offset) request = urllib2.Request(url) request.add_header('Authorization', credentialBing) requestOpener = urllib2.build_opener() response = requestOpener.open(request) results = json.load(response) reviews_urls += [ d['Url'] for d in results['d']['results']] return reviews_urls
The API_KEY
parameter is taken from the Microsoft account, query
is a string which specifies the movie name, and num_reviews = 30
is the number of URLs returned in total from the Bing API. With the list of URLs that contain the reviews, we can now set up a scraper to extract from each web page the title and the review text using Scrapy.
18.223.171.51