Crawler code &#x2013; crawler.py

The following code snippet represents the constructor of the crawler class. It initializes all the relevant instance variables. logger is one of the custom classes written to log debug messages, so that if any error occurs during the execution of the crawler, which will have been spawned as a subprocess and will be running in the background, it can be debugged:

Let's now take a look at the start() method of the crawler, from where the crawling actually begins:

It can be seen in section (1), which will be true for the second iteration (auth=True), that we make a HTTP GET request to whichever URL is supplied as the login URL by the user. We are using the GET method from the Python requests library. When we make the GET request to the URL, the response content (web page) is placed in the xx variable.

Now, as highlighted in section (2), we extract the content of the webpage using the xx.content command and pass the extracted content to the instance of the Beautifulsoup module. Beautifulsoup is an excellent Python utility that makes parsing web pages very simple. From here on, we will represent Beautifulsoup with an alias, BS.

Section (3) uses the s.findall('form') method from the BS parsing library. The findall() method takes the type of the HTML element, which is to be searched as a string argument, and it returns a list containing the search matches. If a web page contains ten forms, s.findall('form') will return a list containing the data for the ten forms. It will look as follows: [<Form1 data>,<Form2 data>, <Form3 data> ....<Form10 data>].

In section (4) of the code, we are iterating over the list of forms that was returned before. The objective here is to identify the login form among multiple input forms that might be present on the web page. We also need to figure out the action URL of the login form, as that will be be the place where we will POST the valid credentials and set a valid session as shown in the following screenshots:

Let's try to break down the preceding incomplete code to understand what has happened so far. Before we move on, however, let's take a look at the user interface from where the crawling parameters are taken from the user. This will give us a good idea about the prerequisites and will help us to understand the code better. The following screen shows a representation of the user input parameters:

As mentioned earlier, the crawler works in two iterations. In the first iteration, it tries to crawl the web application without authentication, and in the second iteration, it crawls the application with authentication. The authentication information is held in the self.auth variable, which by default is initialized to false. Therefore, the first iteration will always be without authentication.

It should be noted that the purpose of the code mentioned before, which falls under the < if self.auth ==True > section, is to identify the login form from the login web page/URL. Once the login form is identified, the code tried to identify all the input fields of that form. It then formulates a data payload with legitimate user credentials to submit the login form. Once submitted, a valid user session will be returned and saved. That session will be used for the second iteration of crawling, which is authentication-based.

In section (5) of the code, we are invoking the self.process_form_action() method. Before that, we extract the action URL of the form, so we know where the data is to be posted. It also combines the relative action URL with the base URL of the application, so that we end up sending our request to a valid endpoint URL. For example, if the form action is pointing to a location called /login, and the current URL is http://127.0.0.1/my_app, this method will carry out the following tasks:

Check whether the URL is already added to a list of URLs that the crawler is supposed to visit
Combine the action URL with the base context URL and return http://127.0.0.1/my_app/login

The definition of this method is shown here:

As can be seen, the first thing that is invoked within this method is another method, self.check_and_add_to_visit. This method checks whether the URL in question has already been added to the list of URLs that the crawler is supposed to crawl. If it is added, then no9 action is done. If not, the crawler adds the URL for it to revisit later. There are many other things that this method checks, such as whether the URL is in scope, whether the protocol is the one permitted, and so on. The definition of this method is shown here:

As can be seen, if self.already_seen() under line 158 returns false, then a row is created in the backend database Page table under the current project. The row is created again via Django ORM (model abstraction). The self.already_seen() method simply checks the Page table to see whether the URL in question has been visited under the current project name and the current authentication mode by the crawler or not. This is verified with the visited Flag:

Page.objects.filter() is equivalent to select * from page where auth_visited=True/False and project='current_project' and URL='current_url'.

In section (6) of the code, we are passing the content of the current form to a newly created instance of the BS parsing module. The reason for this is that we will parse and extract all the input fields from the form that we are currently processing. Once the input fields are extracted, we will compare the name of each input field with the name that is supplied by the user under username_field and password_field. The reason why we do this is that there might be occasions where there are multiple forms on the login page such as a search form, a sign up form, a feedback form, and a login form. We need to be able to identify which of these forms is the login form. As we are asking the user to provide the field name for login username/email and the field name for Login-password, our approach will be to extract the input fields from all forms and compare them with what the user has supplied. If we get a match for both the fields, we set flag1 and flag2 to True. If we get a match within a form, it is very likely that this is our login form. This is the form in which we will place our user supplied login credentials under the appropriate fields and then submit the form at the action URL, as specified under the action parameter. This logic is handled by sections (7), (8), (9), (10), (11), (12), (13), and (14).

There is another consideration that is important. There might be many occasions in which the login web page also has a signup form in it. Let's suppose that the user has specified username and user_pass as the field names for the username and password parameters for our code, to submit proper credentials under these field names to obtain a valid session. However, the signup form also contains another two fields, also called username and user_pass, and this also contains a few additional fields such as Address, Phone, Email, and so on. However, as discussed earlier, our code identifies the login form with these supplied field names only, and may end up considering the signup form as the login form. In order to address this, we are storing all the obtained forms in program lists. When all the forms are parsed and stored, we should have two probable candidates as login forms. We will compare the content length of both, and the one with a shorter length will be taken as the login form. This is because the signup form will usually have more fields than a login form. This condition is handled by section (15) of the code, which enumerates over all the probable forms and finally places the smallest one at index 0 of the payloadforms[] list and the actionform[] list.

Finally, in line 448, we post the supplied user credentials to the valid parsed login form. If the credentials are correct, a valid session will be returned and placed under a session variable, ss. The request is made by invoking the POST method as follows: ss.post(action_forms[0],data=payload,cookie=cookie).

The user provides the start URL of the web application that is to be crawled. Section (16) takes that start URL and begins the crawling process. If there are multiple start URLs, they should be comma separated. The start URLs are added to the Page() database table as a URL that the crawler is supposed to visit:

In section (17), there is a crawling loop that invokes a there_are_pages_to_crawl() method, which checks the backend Page() database table to see whether there are any pages for the current project with the visited flag set = False. If there are pages in the table that have not been visited by the crawler, this method will return True. As we just added the start page to the Page table in section (16), this method will return True for the start page. The idea is to make a GET request on that page and extract all further links, forms, or URLs, and keep on adding them to the Page table. The loop will continue to execute as long as there are unvisited pages. Once the page is completely parsed and all links are extracted, the visited flag is set=True for that page or URL so that it will not be extracted to be crawled again. The definition of this method is shown here:

In section (18), we get the unvisited page from the backend Page table by invoking the get_a_page_to_visit() method, the definition of which is given here:

In section (19), we make a HTTP GET request to this page, along with the session cookies, ss, as section (19) belongs to the iteration that deals with auth=True. Once a request is made to this page, the response of the page is then further processed to extract more links. Before processing the response, we check for the response codes produced by the application.

There are occasions where certain pages will return a redirection (3XX response codes) and we need to save the URLs and form content appropriately. Let's say that we made a GET request to page X and in response we had three forms. Ideally, we will save those forms with the URL marked as X. However, let's say that upon making a GET request on page X, we got a 302 redirection to page Y, and the response HTML actually belonged to the web page where the redirection was set. In that case, we will end up saving the response content of three forms mapped with the URL X, which is not correct. Therefore, in sections (20) and (21), we are handling these redirections and are mapping the response content with the appropriate URL:

Sections (22) and (23) do exactly what the previously mentioned sections (19), (20), and (21) do, but (22) and (23) do it for iterations where authentication =False:

If any exceptions are encountered while processing the current page, section (24) handles those exceptions, marks the visited flag of the current page as True, and puts an appropriate exception message in the database.

If everything works smoothly, then control passes on to section (26), from where the processing of the HTML response content obtained from the GET request on the current page being visited begins. The objective of this processing is to do the following:

Extract all further links from the HTML response (a href, base tags, Frame tags, iframe tags)
Extract all forms from the HTML response
Extract all form fields from the HTML response

Section (26) of the code extracts all the links and URLs that are present under the base tag (if any) of the returned HTML response content.

Sections (27) and (28) parses the content with the BS parsing module to extract all anchor tags and their href locations. Once extracted, they are passed to be added to the Pages database table for the crawler to visit later. It must be noted that the links are added only after checking they don't exist already under the current project and current authentication mode.

Section (29) parses the content with the BS parsing module to extract all iframe tags and their src locations. Once extracted, they are passed to be added to the Pages database table for the crawler to visit later. Section (30) does the same for frame tags:

Section (31) parses the content with the BS parsing module to extract all option tags and checks whether they have a link under the value attribute. Once extracted, they are passed to be added to the Pages database table for the crawler to visit later.

Section (32) of the code tries to explore all other options to extract any missed links from a web page. The following is the code snippet that checks for other possibilities:

Sections (33) and (34) extract all the forms from the current HTML response. If any forms are identified, various attributes of the form tag, such as action or method, are extracted and saved under local variables:

If any HTML form is identified, the next task is to extract all the input fields, text areas, select tags, option fields, hidden fields, and submit buttons. This is carried out by sections (35), (36), (37), (38), and (39). Finally, all the extracted fields are placed under an input_field_list variable in a comma-separated manner. For example, let's say a form, Form1, is identified with the following fields:

<input type ="text" name="search">
<input type="hidden" name ="secret">
<input type="submit" name="submit_button>

All of these are extracted as "Form1" : input_field_list = "search,text,secret,hidden,submit_button,submit".

Section (40) of the code checks whether there are already any forms saved in the database table with the exact same content for the current project and current auth_mode. If no such form exists, the form is saved in the Form table, again with the help of the Django ORM (models) wrapper:

Section (41) of the previous code goes ahead and saves these unique forms in a JSON file with the name as the current project name. This file can then be parsed with a simple Python program to list various forms and input fields present in the web application that we crawled. Additionally, at the end of the code, we have a small snippet that places all discovered/crawled pages in a text file that we can refer to later. The snippet is shown here:

 f= open("results/Pages_"+str(self.project.project_name))
    for pg in page_list:
        f.write(pg+"
")
 f.close()

Section (42) of the code updates the visited flag of the web page whose content we just parsed and marks that as visited for the current auth mode. If any exceptions occur during saving, these are handled by section (43), which again marks the visited flag as true, but additionally adds an exception message.

After sections (42) and (43), the control goes back again to section (17) of the code. The next page that is yet to be visited by the crawler is taken from the database and all the operations are repeated. This continues until all web pages have been visited by the crawler.

Finally, we check whether the current iteration is with or without authentication in section (44). If it was without authentication, then the start() method of the crawler is invoked with the auth flag set to True.

After both the iterations are successfully finished, the web application is assumed to be crawled completely and the project status is marked as Finished by section (45) of the code.

Table of Contents for
Crawler code – crawler.py

Crawler code – crawler.py

Table of Contents for Crawler code &#x2013; crawler.py

Create new playlist

Sign In

Sign Up

Table of Contents for
Crawler code – crawler.py