Search engine choice and the application code

Since scraping directly from the most relevant search engines such as Google, Bing, Yahoo, and others is against their term of service, we need to take initial review pages from their REST API (using scraping services such as Crawlera, http://crawlera.com/, is also possible). We decided to use the Bing service, which allows 5,000 queries per month for free.

In order to do that, we register to the Microsoft Service to obtain the key needed to allow the search. Briefly, we followed these steps:

  1. Register online on https://datamarket.azure.com.
  2. In My Account, take the Primary Account Key.
  3. Register a new application (under DEVELOPERS | REGISTER; put Redirect URI: https://www.bing.com)

After that, we can write a function that retrieves as many URLs relevant to our query as we want:

num_reviews = 30 
def bing_api(query):
    keyBing = API_KEY        # get Bing key from: https://datamarket.azure.com/account/keys
    credentialBing = 'Basic ' + (':%s' % keyBing).encode('base64')[:-1] # the "-1" is to remove the trailing "
" which encode adds
    searchString = '%27X'+query.replace(" ",'+')+'movie+review%27'
    top = 50#maximum allowed by Bing
    
    reviews_urls = []
    if num_reviews<top:
        offset = 0
        url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + 
              'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, num_reviews, offset)

        request = urllib2.Request(url)
        request.add_header('Authorization', credentialBing)
        requestOpener = urllib2.build_opener()
        response = requestOpener.open(request)
        results = json.load(response)
        reviews_urls = [ d['Url'] for d in results['d']['results']]
    else:
        nqueries = int(float(num_reviews)/top)+1
        for i in xrange(nqueries):
            offset = top*i
            if i==nqueries-1:
                top = num_reviews-offset
                url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + 
                      'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, top, offset)

                request = urllib2.Request(url)
                request.add_header('Authorization', credentialBing)
                requestOpener = urllib2.build_opener()
                response = requestOpener.open(request) 
            else:
                top=50
                url = 'https://api.datamarket.azure.com/Bing/Search/Web?' + 
                      'Query=%s&$top=%d&$skip=%d&$format=json' % (searchString, top, offset)

                request = urllib2.Request(url)
                request.add_header('Authorization', credentialBing)
                requestOpener = urllib2.build_opener()
                response = requestOpener.open(request) 
            results = json.load(response)
            reviews_urls += [ d['Url'] for d in results['d']['results']]
    return reviews_urls

The API_KEY parameter is taken from the Microsoft account, query is a string which specifies the movie name, and num_reviews = 30 is the number of URLs returned in total from the Bing API. With the list of URLs that contain the reviews, we can now set up a scraper to extract from each web page the title and the review text using Scrapy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.58.194