Case study

To tie everything together, we'll be writing a simple link collector, which will visit a website and collect every link on every page it finds in that site. Before we start, though, we'll need some test data to work with. Simply write some HTML files to work with that contain links to each other and to other sites on the internet, something like this:

	<html>
		<body>
			<a href="contact.html">Contact us</a>
			<a href="blog.html">Blog</a>
			<a href="http://esme.html">My Dog</a>
			<a href="http:///hobbies.html">Some hobbies</a>
			<a href="http:///contact.html">Contact AGAIN</a>
			<a href="http://www.archlinux.org/">Favorite OS</a>
		</body>
	</html>

Name one of the files index.html so it shows up first when pages are served. Make sure the other files exist, and keep things complicated so there is lots of linking between them. The examples for this chapter include a directory called case_study_serve (one of the lamest personal websites in existence!) if you'd rather not set them up yourself.

Now, start a simple web server by entering the directory containing all these files and run the following command:


python3 -m http.server

This will start a server running on port 8000; you can see the pages you made by visiting http://localhost:8000/ in your web browser.

Note

I doubt anyone can get a website up and running with less work! Never let it be said, "you can't do that easily with Python".

The goal will be to pass our collector the base URL for the site (in this case: http://localhost:8000/), and have it create a list containing every unique link on the site. We'll need to take into account three types of URLs (links to external sites, which start with http://, absolute internal links, which start with a /, and relative links, for everything else). We also need to be aware that pages may link to each other in a loop; we need to be sure we don't process the same page multiple times, or it may never end. With all this uniqueness going on, it sounds like we're going to need some sets.

Before we get into that, let's start with the basics. What code do we need to connect to a page and parse all the links from that page?:

	from urllib.request import urlopen
	from urllib.parse import urlparse
	import re
	import sys
	LINK_REGEX = re.compile(
			"<a [^>]*href=['"]([^'"]+)['"][^>]*>")

	class LinkCollector:
		def __init__(self, url):
			self.url = "http://" + urlparse(url).netloc

		def collect_links(self, path="/"):
			full_url = self.url + path
			page = str(urlopen(full_url).read())
			links = LINK_REGEX.findall(page)
			print(links)
			
	if __name__ == "__main__":
		LinkCollector(sys.argv[1]).collect_links()

This is a short piece of code, considering what it's doing. It connects to the server in the argument passed on the command line, downloads the page, and extracts all the links on that page. The __init__ method uses the urlparse function to extract just the hostname from the url; so even if we pass in http://localhost:8000/some/page.html it will still operate on the top-level of the host, http://localhost:8000/. This makes sense, because we want to collect all the links on the site.

The collect_links method connects to and downloads the specified page from the server, and uses a regular expression to find all the links in the page. Regular expressions are an extremely powerful string processing tool. Unfortunately, they have a steep learning curve; if you haven't used them before, I strongly recommend studying any of the entire books or websites on the topic. If you don't think they're worth knowing, try writing the above code without them.

The example above stops in the middle of the collect_links method to print the value of links. This is a common way to test a program as we're writing it: stop and output the value to ensure it is the value we expect. Here's what it outputs for our example:

	['contact.html', 'blog.html', 'esme.html', '/hobbies.html',
	'/contact.html', 'http://www.archlinux.org/']

So now we have a collection of all the links in the first page. What should we do with it? We can't just pop the links into a set to remove duplicates, because links may be relative or absolute. For example contact.html and /contact.html point to the same page. No, the first thing we should do is normalize all the links to their full URL, including hostname and relative path. We can do this by adding a normalize_url method to our object:

		def normalize_url(self, path, link):
			if link.startswith("http://"):
				return link
			elif link.startswith("/"):
				return self.url + link
			else:
				return self.url + path.rpartition('/'
						)[0] + '/' + link

This method converts all URLs to complete URLs with protocol and hostname. Now the two contact pages have the same value and we can store them in a set. We'll have to modify __init__ to create the set, and collect_links to put all the links into it.

Then we'll have to visit all the non-external links and collect them too. But wait a minute, if we do that, how do we keep from revisiting a link when we encounter the same page twice? It looks like we're actually going to need two sets, a set of collected links, and a set of visited links. This suggests that we were wise to choose a set to represent our data; we know that sets are most useful when we're manipulating more than one of them. Let's set these up:

	class LinkCollector:
		def __init__(self, url):
			self.url = "http://+" + urlparse(url).netloc
			self.collected_links = set()
			self.visited_links = set()
	
		def collect_links(self, path="/"):
			full_url = self.url + path
			self.visited_links.add(full_url)
			page = str(urlopen(full_url).read())
			links = LINK_REGEX.findall(page)
			links = {self.normalize_url(path, link
				) for link in links}
			self.collected_links = links.union(
					self.collected_links)
			unvisited_links = links.difference(
					self.visited_links)
			print(links, self.visited_links,
					self.collected_links, unvisited_links)

The line that creates the normalized list of links uses a set comprehension, no different from a list comprehension, except that it creates a set out of the values. We'll be covering these babies in detail in the next chapter. Once again, the method stops to print out the current values, so we can verify we don't have our sets confused, and that difference really was the method we wanted to call to collect unvisited_links. We can then add a few lines of code that loop over all the unvisited links and adds them to the collection as well:

			for link in unvisited_links:
				if link.startswith(self.url):
					self.collect_links(urlparse(link).path)

The if statement ensures that we are only collecting links from the one website; we don't want to go off and collect all the links from all the pages on the internet (unless we're Google or the Internet Archive!). If we modify the main code at the bottom of the program to output the collected links, we can see it seems to have collected them all:

	if __name__ == "__main__":
		collector = LinkCollector(sys.argv[1])
		collector.collect_links()
		for link in collector.collected_links:
			print(link)

It displays all the links we've collected, and only once; even though many of the pages in my example linked to each other multiple times:


$ python3 link_collector.py http://localhost:8000
http://localhost:8000/
http://en.wikipedia.org/wiki/Cavalier_King_Charles_Spaniel
http://masterhelenwu.com
http://archlinux.me/dusty/
http://localhost:8000/blog.html
http://ccphillips.net/
http://localhost:8000/contact.html
http://localhost:8000/taichi.html
http://www.archlinux.org/
http://localhost:8000/esme.html
http://localhost:8000/hobbies.html

Even though it collected links to external pages, it didn't go off collecting links from any of the external pages we linked to. This is a great little program if we want to collect all the links in a site. But it doesn't give me all the information I might need to build a site map; it tells me which pages I have, but it doesn't tell me which pages link to other pages. If we want to do that instead, we're going to have to make some modifications.

The first thing we should do is look at our data structures. The set of collected links doesn't work anymore; we want to know which links were linked to from which pages. The first thing we could do, then, is turn that set into a dictionary of sets for each page we visit. The dictionary keys will represent the exact same data that is currently in the set. The values will be sets of all the links on that page. Here are the changes:

	from urllib.request import urlopen
	from urllib.parse import urlparse
	import re
	import sys
	LINK_REGEX = re.compile(
			"<a [^>]*href=['"]([^'"]+)['"][^>]*>")

	class LinkCollector:
		def __init__(self, url):
			self.url = "http://%s" % urlparse(url).netloc
			self.collected_links = {}
			self.visited_links = set()
	
		def collect_links(self, path="/"):
			full_url = self.url + path
			self.visited_links.add(full_url)
			page = str(urlopen(full_url).read())
			links = LINK_REGEX.findall(page)
			links = {self.normalize_url(path, link
				) for link in links}
			self.collected_links[full_url] = links
			for link in links:
				self.collected_links.setdefault(link, set())
			unvisited_links = links.difference(
					self.visited_links)
			for link in unvisited_links:
				if link.startswith(self.url):
					self.collect_links(urlparse(link).path)
					
		def normalize_url(self, path, link):
			if link.startswith("http://"):
				return link
			elif link.startswith("/"):
				return self.url + link
			else:
				return self.url + path.rpartition('/'
						)[0] + '/' + link
	if __name__ == "__main__":
		collector = LinkCollector(sys.argv[1])
		collector.collect_links()
		for link, item in collector.collected_links.items():
			print("{}: {}".format(link, item))

It is a surprisingly small change; the line that originally created a union of two sets has been replaced with three lines that update the dictionary. The first of these simply tells the dictionary what the collected links for that page are. The second creates an empty set for any items in the dictionary that have not already been added to the dictionary, using setdefault. The result is a dictionary that contains all the links as its keys, mapped to sets of links for all the internal links, and empty sets for the external links.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.17.137