Finding e-mail addresses from web pages

Instead of generating your own e-mail list, you may find that a target organisation will have some that exist on their web pages. This may prove to be of higher value than e-mail addresses you have generated yourself as the likelihood of e-mail addresses on a target organisation's website being valid will be much higher than ones you have tried to guess.

Getting ready

For this recipe, you will need a list of pages you want to parse for e-mail addresses. You may want to visit the target organization's website and search for a sitemap. A sitemap can then be parsed for links to pages that exist within the website.

How to do it…

The following code will parse through responses from a list of URLs for instances of text that match an e-mail address format and save them to a file:

import urllib2
import re
import time
from random import randint
regex = re.compile(("([a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:.[a-z0- 9!#$%&'*+/=?^_'"
                    "{|}~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9- ]*[a-z0-9])?(.|"
                    "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

tarurl = open("urls.txt", "r")
for line in tarurl:
  output = open("emails.txt", "a")
  time.sleep(randint(10, 100))
  try: 
    url = urllib2.urlopen(line).read()
    output.write(line)
    emails = re.findall(regex, url)
    for email in emails:
      output.write(email[0]+"
")
      print email[0]
  except:
    pass
    print "error"
  output.close()

How it works…

After importing the necessary modules, you will see the assignment of the regex variable:

regex = re.compile(("([a-z0-9!#$%&'*+/=?^_'{|}~-]+(?:.[a-z0- 9!#$%&'*+/=?^_'"
                    "{|}~-]+)*(@|sats)(?:[a-z0-9](?:[a-z0-9- ]*[a-z0-9])?(.|"
                    "sdots))+[a-z0-9](?:[a-z0-9-]*[a-z0-9])?)"))

This attempts to match an e-mail address format, for example [email protected], or victim at target dot com. The code then opens up a file containing the URLs:

tarurl = open("urls.txt", "r")

You might notice the use of the parameter r . This opens the file in read-only mode. The code then loops through the list of URLs. Within the loop, a file is opened to save e-mail addresses to:

output = open("emails.txt", "a")

This time, the a parameter is used. This indicates that any input to this file will be appended instead of overwriting the entire file. The script utilizes a sleep timer in order to avoid triggering any protective measures the target may have in place to prevent attacks:

time.sleep(randint(10, 100))

This timer will pause the script for a random amount of time between 10 and 100 seconds.

The use of exception handling when using the urlopen() method is essential. If the response from urlopen() is 404 (HTTP not found error), then the script will error and exit.

If there is a valid response, the script will then store all instances of e-mail addresses in the emails variable:

emails = re.findall(regex, url)

It will then loop through the emails variable and write each item in the list to the emails.txt file and also output it to the console for confirmation:

    for email in emails:
      output.write(email[0]+"
")
      print email[0]

There's more…

The regular expression matching used in this recipe matches two common types of format used to represent e-mail addresses on the Internet. During the course of your learning and investigations, you may come across other formats that you might like to include in your matching. For more information on regular expressions in Python, you may want read the documentation on the Python website for regular expressions at https://docs.python.org/2/library/re.html.

See also

Refer to the recipe Generating e-mail addresses from names for more information.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.137.59