Crawling links present in a web page

At times you would like to find a specific keyword present in a web page. In a web browser, you can use the browser's in-page search facility to locate the terms. Some browsers can highlight it. In a complex situation, you would like to dig deep and follow every URL present in a web page and find that specific term. This recipe will automate that task for you.

How to do it...

Let us write a search_links() function that will take three arguments: the search URL, the depth of the recursive search, and the search key/term, since every URL may have links present in the content and that content may have more URLs to crawl. To limit the recursive search, we define a depth. Upon reaching that depth, no more recursive search will be done.

Listing 6.7 gives the code for crawling links present in a web page, as shown in the following code:

#!/usr/bin/env python
# Python Network Programming Cookbook -- Chapter - 6
# This program is optimized for Python 2.7.
# It may run on any other version with/without modifications.

import argparse
import sys
import httplib
import re

processed = []

def search_links(url, depth, search):
  # Process http links that are not processed yet
  url_is_processed = (url in processed)
  if (url.startswith("http://") and (not url_is_processed)):
    processed.append(url)
    url = host = url.replace("http://", "", 1)
    path = "/"

    urlparts = url.split("/")
    if (len(urlparts) > 1):
      host = urlparts[0]
      path = url.replace(host, "", 1)

     # Start crawling
     print "Crawling URL path:%s%s " %(host, path)
     conn = httplib.HTTPConnection(host)
     req = conn.request("GET", path)
     result = conn.getresponse()

    # find the links
    contents = result.read()
    all_links = re.findall('href="(.*?)"', contents)

    if (search in contents):
      print "Found " + search + " at " + url

      print " ==> %s: processing %s links" %(str(depth), 
str(len(all_links)))
      for href in all_links:
      # Find relative urls
      if (href.startswith("/")):
        href = "http://" + host + href

        # Recurse links
        if (depth > 0):
          search_links(href, depth-1, search)
    else:
      print "Skipping link: %s ..." %url


if __name__ == '__main__':
  parser = argparse.ArgumentParser(description='Webpage link 
crawler')
  parser.add_argument('--url', action="store", dest="url", 
required=True)
  parser.add_argument('--query', action="store", dest="query", 
required=True)
  parser.add_argument('--depth', action="store", dest="depth", 
default=2)

  given_args = parser.parse_args() 

  try:
    search_links(given_args.url,  
given_args.depth,given_args.query)
    except KeyboardInterrupt:
      print "Aborting search by user request."

If you run this script to search www.python.org for python, you will see an output similar to the following:

$ python 6_7_python_link_crawler.py --url='http://python.org' --query='python' 
Crawling URL path:python.org/ 
Found python at python.org 
 ==> 2: processing 123 links 
Crawling URL path:www.python.org/channews.rdf 
Found python at www.python.org/channews.rdf 
 ==> 1: processing 30 links 
Crawling URL path:www.python.org/download/releases/3.4.0/ 
Found python at www.python.org/download/releases/3.4.0/ 
 ==> 0: processing 111 links 
Skipping link: https://ep2013.europython.eu/blog/2013/05/15/epc20145-call-proposals ... 
Crawling URL path:www.python.org/download/releases/3.2.5/ 
Found python at www.python.org/download/releases/3.2.5/ 
 ==> 0: processing 113 links 
...
Skipping link: http://www.python.org/download/releases/3.2.4/ ... 
Crawling URL path:wiki.python.org/moin/WikiAttack2013 
^CAborting search by user request. 

How it works...

This recipe can take three command-line inputs: search URL (--url), the query string (--query), and the depth of recursion (--depth). These inputs are processed by the argparse module.

When the search_links() function is called with the previous arguments, this will recursively iterate on all the links found on that given web page. If it takes too long to finish, you would like to exit prematurely. For this reason, the search_links() function is placed inside a try-catch block which can catch the user's keyboard interrupt action, such as Ctrl + C.

The search_links() function keeps track of visited links via a list called processed. This is made global to give access to all the recursive function calls.

In a single instance of search, it is ensured that only HTTP URLs are processed in order to avoid the potential SSL certificate errors. The URL is split into a host and a path. The main crawling is initiated using the HTTPConnection() function of httplib. It gradually makes a GET request and a response is then processed using the regular expression module re. This collects all the links from the response. Each response is then examined for the search term. If the search term is found, it prints that incident.

The collected links are visited recursively in the same way. If any relative URL is found, that instance is converted into a full URL by prefixing http:// to the host and the path. If the depth of search is greater than 0, the recursion is activated. It reduces the depth by 1 and runs the search function again. When the search depth becomes 0, the recursion ends.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.97.216