Information gathering of a website from SmartWhois by the parser BeautifulSoup

Consider a situation where you want to glean all the hyperlinks from the webpage. In this section, we will do this by programming. On the other hand, this can also be done manually by viewing the view source of the web page. However this will take some time.

So let's get acquainted with a very beautiful parser called BeautifulSoup. This parser is from a third-party source and is very easy to work with. In our code, we will use version 4 of BeautifulSoup.

The requirement is the title of the HTML page and hyperlinks.

The code is as follows:

import urllib
from bs4 import BeautifulSoup
url = raw_input("Enter the URL ")
ht= urllib.urlopen(url)
html_page = ht.read()
b_object = BeautifulSoup(html_page)
print b_object.title
print b_object.title.text
for link in b_object.find_all('a'):
  print(link.get('href'))

The from bs4 import BeautifulSoup statement is used to import the BeautifulSoup library. The url variable stores the URL of the website, and urllib.urlopen(url) opens the webpage while the ht.read() function stores the webpage. The html_page = ht.read() statement assigns the webpage to a html_page variable. For better understanding, we have used this variable. In the b_object = BeautifulSoup(html_page) statement, an object of b_object is created. The next statement prints the title name with tags and without tags. The next b_object.find_all('a') statement saves all the hyperlinks. The next line prints only the hyperlinks part. The output of the program will clear all doubts, and is shown in the following screenshot:

Information gathering of a website from SmartWhois by the parser BeautifulSoup

All the hyperlinks and a title

Now, you have seen how you can obtain the hyperlinks and a title by using beautiful parser.

In the next code, we will obtain a particular field with the help of BeautifulSoup:

import urllib
from bs4 import BeautifulSoup
url = "https://www.hackthissite.org"
ht= urllib.urlopen(url)
html_page = ht.read()
b_object = BeautifulSoup(html_page)
data = b_object.find('div', id ='notice')
print data

The preceding code has taken https://www.hackthissite.org as the url, and in the following code, we are interested in finding where <div id = notice> is, as shown in the following screenshot:

Information gathering of a website from SmartWhois by the parser BeautifulSoup

The div ID information

Now let's see the output of the preceding code in the following screenshot:

Information gathering of a website from SmartWhois by the parser BeautifulSoup

The output of the <div id =notice> code

Consider another example in which you want to gather information about a website. In the process of information gathering for a particular website, you have probably used http://smartwhois.com/. By using SmartWhois, you can obtain useful information about any website, such as the Registrant Name, Registrant Organization, Name Server, and so on.

In the following code, you will see how you can obtain the information from SmartWhois. In the quest of information gathering, I have studied SmartWhois and found out that its <div class="whois"> tag retains the relevant information. The following program will gather the information from this tag and save it in a file in a readable form:

import urllib
from bs4 import BeautifulSoup
import re
domain=raw_input("Enter the domain name ")
url = "http://smartwhois.com/whois/"+str(domain)
ht= urllib.urlopen(url)
html_page = ht.read()
b_object = BeautifulSoup(html_page)
file_text= open("who.txt",'a')
who_is = b_object.body.find('div',attrs={'class' : 'whois'})
who_is1=str(who_is)

for match in re.finditer("Domain Name:",who_is1):
      s= match.start()

lines_raw = who_is1[s:]  
lines = lines_raw.split("<br/>",150)    
i=0
for line in lines :
  file_text.writelines(line)
  file_text.writelines("
")
  print line
  i=i+1
  if i==17 :
    break
file_text.writelines("-"*50)
file_text.writelines("
")
file_text.close()

Let's analyze the file_text= open("who.txt",'a') statement since I hope you followed the previous code. The file_text file object opens a who.txt file in append mode to store the results. The who_is = b_object.body.find('div',attrs={'class' : 'whois'}) statement produces the desired result. However, who_is does not contain all the data in string form. If we use b_object.body.find('div',attrs={'class' : 'whois'}).text, it will output all the text that contains the tags, but this information becomes very difficult to read. The who_is1=str(who_is) statement converts the information into string form:

for match in re.finditer("Domain Name:",who_is1):
      s= match.start()

The preceding code finds the starting point of the "Domain Name:" string because our valuable information comes after this string. The lines_raw variable contains the information after the "Domain Name:" string. The lines = lines_raw.split("<br/>",150) statement splits the lines by using the <br/> delimiter, and the "lines" variable becomes a list. It means that in an HTML page, where a break (</br>) exists, the statement will make a new line and all lines will be stored in a list named lines. The i=0 variable is initialized as 0, which will be further used to print the number of lines as a result. The following piece of code saves the results in the form of a file that exists on a hard disk as well as displaying the results on the screen.

The screen output is as follows:

Information gathering of a website from SmartWhois by the parser BeautifulSoup

Information provided by SmartWhois

Now, let's check out the output of the code in the files:

Information gathering of a website from SmartWhois by the parser BeautifulSoup

The code's output in the files

Note

You have seen how to obtain hyperlinks from a webpage and, by using the previous code, you can get the information about the hyperlinks. Don't stop here; instead, try to read more about BeautifulSoup at http://www.crummy.com/software/BeautifulSoup/bs4/doc/.

Now, let's go through an exercise that takes domain names in a list as an input and writes the results of the findings in a single file.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.39.133