Downloading the page for offline analysis with HTTrack

As stated on HTTrack's official website (http://www.httrack.com):

"It allows you to download a World Wide Web site from the Internet to a local directory, building recursively all directories, getting HTML, images, and other files from the server to your computer."

We will be using HTTrack in this recipe to download the whole content of an application's site.

Getting ready

HTTrack is not installed by default in Kali Linux, so we will need to install it, as shown:

apt-get update
apt-get install httrack

How to do it...

  1. Our first step will be to create a directory to store the downloaded site and then enter it:
    mkdir bodgeit_httrack
    cd bodgeit_httrack
    
  2. The simplest way to use HTTrack is by adding the URL that we want to download to the command:
    httrack http://192.168.56.102/bodgeit/
    

    It is important to set the last "/"; if it is omitted, HTTrack will return a 404 error because there is no "bodgeit" file in the root of the server.

    How to do it...
  3. Now, if we go to file:///root/MyCookbook/test/bodgeit_httrack/index.html (or the path you selected in your test environment), we will see that we can browse the whole site offline:
    How to do it...

How it works...

HTTrack creates a full static copy of the site, which means that all dynamic content, such as responses to user inputs, won't be available. Inside the folder we downloaded the site, we can see the following files and directories:

  • A directory named after the server's name or address, which contains all the files that were downloaded.
  • A cookies.txt file, which contains the cookies information used to download the site.
  • The hts-cache directory contains a list of files detected by the crawler; this is the list of files that httrack processed.
  • The hts-log.txt file contains the errors, warnings, and other information reported during the crawling and downloading of the site.
  • An index.html file that redirects to the copy of the original index file located in the server-name directory.

There's more...

HTTrack also has an extensive collection of options that will allow us to customize its behavior to fit our needs better. The following are some useful modifiers to consider:

  • -rN: Sets the depth to N levels of links to follow
  • -%eN: Sets the limit depth to external links
  • +[pattern]: Tells HTTrack to whitelist all URL matching [pattern], for example +*google.com/*
  • -[pattern]: Tells HTTrack to blacklist (omit from downloading) all links matching the pattern
  • -F [user-agent]: This options allows us to define the user-agent (browser identifier) that we want to use to download the site
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.79.241