In this script, we can see how to extract links using urllib2 and HTMLParser. HTMLParser is a module that allows us to parse text files formatted in HTML.
You can get more information at https://docs.python.org/2/library/htmlparser.html.
You can find the following code in the get_links_from_url.py file:
#!/usr/bin/python
import urllib2
from HTMLParser import HTMLParser
class myParser(HTMLParser):
def handle_starttag(self, tag, attrs):
if (tag == "a"):
for a in attrs:
if (a[0] == 'href'):
link = a[1]
if (link.find('http') >= 0):
print(link)
newParse = myParser()
newParse.feed(link)
web = raw_input("Enter url: ")
url = "http://"+web
request = urllib2.Request(url)
handle = urllib2.urlopen(request)
parser = myParser()
parser.feed(handle.read().decode('utf-8'))
In the following screenshot, we can see the script in execution for the python.org domain: