Chapter 3

Google Hacking Basics

Abstract

This chapter will cover Google hacking basics. Subjects will include using caches for anonymity, directory listings, and traversal techniques.

Keywords

Google cache
anonymity
directories
traversal
intitle:index.of

Introduction

A fairly large portion of this book is dedicated to the techniques the “bad guys” will use to locate sensitive information. We present this information to help you become better informed about their motives so that you can protect yourself and perhaps your customers. We’ve already looked at some of the benign basic searching techniques that are foundational for any Google user who wants to break the barrier of the basics and charge and go to the next level: the ways of the Google hacker. Now we’ll start looking at more nefarious uses of Google that hackers are likely to employ.
First, we’ll talk about Google’s cache. If you haven’t already experimented with the cache, you’re missing out. I suggest you at least click a few various cached links from the Google search results page before reading further. As any decent Google hacker will tell you, there’s a certain anonymity that comes with browsing the cached version of a page. That anonymity goes only so far, and there are some limitations to the coverage it provides. Google can, however, very nicely veil your crawling activities to the point that the target Web site might not even get a single packet of data from you as you cruise the Web site. We’ll show you how it’s done.
Next, we’ll talk about directory listings. These “ugly” Web pages are chock full of information, and their mere existence serves as the basis for some of the more advanced attack searches that we’ll discuss in later chapters.
To round things out, we’ll take a look at a technique that has come to be known as traversing: the expansion of a search to try and gather more information. We’ll look at directory traversal, number range expansion, and extension trolling, all of which are techniques that should be second nature to any decent hacker – and the good guys that defend against them.

Anonymity with caches

Google’s cache feature is truly an amazing thing. The simple fact is that if Google crawls a page or document, you can almost always count on getting a copy of it, even if the original source has since dried up and blown away. Of course the down side of this is that hackers can get a copy of your sensitive data even if you’ve pulled the plug on that pesky Web server. Another down side of the cache is that the bad guys can crawl your entire Web site (including the areas you “forgot” about) without even sending a single packet to your server. If your Web server doesn’t get so much as a packet, it can’t write anything to the log files. (You are logging your Web connections, aren’t you?) If there’s nothing in the log files, you might not have any idea that your sensitive data has been carried away. It’s sad that we even have to think in these terms, but untold megabytes, gigabytes, and even terabytes of sensitive data leak from Web servers every day. Understanding how hackers can mount an anonymous attack on your sensitive data via Google’s cache is of utmost importance.
Google grabs a copy of most Web data that it crawls. There are exceptions, and this behavior is preventable, as we’ll discuss later, but the vast majority of the data Google crawls is copied and filed away, accessible via the cached link on the search page. We need to examine some subtleties to Google’s cached document banner. The banner shown in Figure 3.1 was gathered from www.phrack.org.
image
Figure 3.1 
If you’ve gotten so familiar with the cache banner that you just blow right past it, slow down a bit and actually read it. The cache banner in Figure 3.2 notes, “This cached page may reference images which are no longer available. ”This message is easy to miss, but it provides an important clue about what Google’s doing behind the scenes.
image
Figure 3.2 
To get a better idea of what’s happening, let’s take a look at a snippet of tcpdump output gathered while browsing this cached page. To capture this data, tcpdump is simply run as tcpdump –n. Your installation or implementation of tcpdump might require you to also set a listening interface with the –i switch.
Let’s take apart this output a bit, starting at the bottom. This is a port 80 (Web) conversation between our browser (10.9.5) and a Google server (66.249.83.83). This is the type of traffic we should expect from any transaction with Google, but the beginning of the capture reveals another port 80 (Web) connection to 200.199.20.162. This is not a Google server, and an nslookup of that Internet Protocol (IP) shows that it is the www.phrack.org Web server. The connection to this server can be explained by rerunning tcpdump with more options specifically designed to show a few hundred bytes of the data inside the packets as well as the headers, and shift-reloading the cached page. Shift-reloading forces most browsers to contact the Web host again, not relying on any caches the browser might be using.
Lines 0x30 and 0x40 show that we are downloading (via a GET request) an image file – specifically, a JPG image from the server. Farther along in the network trace, a Host field reveals that we are talking to the www.phrack.org Web server. Because of this Host header and the fact that this packet was sent to IP address 200.199.20.162, we can safely assume that the Phrack Web server is virtually hosted on the physical server located at that address. This means that when viewing the cached copy of the Phrack Web page, we are pulling images directly from the Phrack server itself. If we were striving for anonymity by viewing the Google cached page, we just blew our cover! Furthermore, line 0x90 shows that the REFERER field was passed to the Phrack server, and that field contained a Uniform Resource Locator (URL) reference to Google’s cached copy of Phrack’s page. This means that not only were we not anonymous, but our browser informed the Phrack Web server that we were trying to view a cached version of the page! So much for anonymity.
It’s worth noting that most real hackers use proxy servers when browsing a target’s Web pages, and even their Google activities are first bounced off a proxy server. If we had used an anonymous proxy server for our testing, the Phrack Web server would have gotten our proxy server’s IP address only, not our actual IP address.
The cache banner does, however, provide an option to view only the data that Google has captured, without any external references. Despite the fact that we loaded the same page as before, this time we communicated only with a Google server (at 216.239.51.104), not any external servers. If we were to look at the URL generated by clicking the “cached text only” link in the cached page’s header, we would discover that Google appended an interesting parameter, &strip = 1. This parameter forces a Google cache URL to display only cached text, avoiding any external references. This URL parameter only applies to URLs that refer to a Google cached page.
Pulling it all together, we can browse a cached page with a fair amount of anonymity without a proxy server, using a quick cut and paste and a URL modification. As an example, consider query for site:phrack.org. Instead of clicking the cached link, we will right-click the cached link and copy the URL to the Clipboard. Browsers handle this action differently, so use whichever technique works for you to capture the URL of this link.
Once the URL is copied to the Clipboard, paste it into the address bar of your browser, and append the &strip=1 parameter to the end of the URL. The URL should now look something like http://216.239.51.104/search?q=cache:LBQZIrSkMgUJ: www.phrack.org/+site:phrack.org&hl=en&ct=clnk&cd=1&gl=us&client=safari&strip=1. Press Enter after modifying the URL to load the page, and you will be taken to the stripped version of the cached page, which has a slightly different banner.
Notice that the stripped cache header reads differently than the standard cache header. Instead of the “This cached page may reference images which are no longer available” line, there is a new line that reads, “Click here for the full cached version with images included.” This is an indicator that the current cached page has been stripped of external references. Unfortunately, the stripped page does not include graphics, so the page could look quite different from the original, and in some cases a stripped page might not be legible at all. If this is the case, it never hurts to load up a proxy server and hit the page, but real Google hackers “don’t need no steenkin’ proxy servers!”

Directory listings

A directory listing is a type of Web page that lists files and directories that exist on a Web server. Designed to be navigated by clicking directory links, directory listings typically have a title that describes the current directory, a list of files and directories that can be clicked, and often a footer that marks the bottom of the directory listing. Each of these elements is shown in the sample directory listing in Figure 3.3.
image
Figure 3.3 
Much like an FTP server, directory listings offer a no-frills, easy-install solution for granting access to files that can be stored in categorized folders. Unfortunately, directory listings have many faults, specifically:
They are not secure in and of themselves. They do not prevent users from downloading certain files or accessing certain directories. This task is often left to the protection measures built into the Web server software or third-party scripts, modules, or programs designed specifically for that purpose.
They can display information that helps an attacker learn specific technical details about the Web server.
They do not discriminate between files that are meant to be public and those that are meant to remain behind the scenes.
They are often displayed accidentally, since many Web servers display a directory listing if a top-level index file (index.htm, index.html, default.asp, and so on) is missing or invalid.
All this adds up to a deadly combination. In the following section, we’ll take a look at some of the ways Google hackers can take advantage of directory listings.

Locating directory listings

The most obvious way an attacker can abuse a directory listing is by simply finding one! Since directory listings offer “parent directory” links and allow browsing through files and folders, even the most basic attacker might soon discover that sensitive data can be found by simply locating the listings and browsing through them.
Locating directory listings with Google is fairly straightforward. Figure 3.3 shows that most directory listings begin with the phrase “Index of,” which also shows in the title. An obvious query to find this type of page might be intitle:index.of, which could find pages with the term “index of” in the title of the document. Remember that the period (.) serves as a single-character wildcard in Google. Unfortunately, this query will return a large number of false positives, such as pages with the following titles:
Index of Native American Resources on the Internet
LibDex – Worldwide index of library catalogues
Iowa State Entomology Index of Internet Resources
Judging from the titles of these documents, it is obvious that not only are these Web pages intentional, they are also not the type of directory listings we are looking for. As Ben Kenobi might say, “This is not the directory listing you’re looking for.” Several alternate queries provide more accurate results – for example, intitle:index.of “parent directory” (shown in Figure 3.4) or intitle:index.of name size. These queries indeed reveal directory listings by not only focusing on index.of in the title, but on keywords often found inside directory listings, such as parent directory, name, and size. Even judging from the summary on the search results page, you can see that these results are indeed the types of directory listings we’re looking for.
image
Figure 3.4 

Finding specific directories

In some cases, it might be beneficial not only to look for directory listings, but also to look for directory listings that allow access to a specific directory. This is easily accomplished by adding the name of the directory to the search query. To locate “admin” directories that are accessible from directory listings, queries such as intitle:index.of.admin or intitle:index.of inurl:admin will work well.

Finding specific files

Because these types of pages list names of files and directories, it is possible to find very specific files within a directory listing. For example, to find WS_FTP log files, try a search such as intitle:index.of ws_ftp.log. This technique can be extended to just about any kind of file by keying in on the index.of in the title and the filename in the text of the Web page.
You can also use filetype and inurl to search for specific files. To search again for ws_ftp.log files, try a query like filetype:log inurl:ws_ftp.log. This technique will generally find more results than the somewhat restrictive index.of search. We’ll be working more with specific file searches throughout the book.

Server versioning

One piece of information an attacker can use to determine the best method for attacking a Web server is the exact software version. An attacker could retrieve that information by connecting directly to the Web port of that server and issuing a request for the Hypertext Transfer Protocol (HTTP) (Web) headers. It is possible, however, to retrieve similar information from Google without ever connecting to the target server. One method involves using the information provided in a directory listing.
Figure 3.5 shows the bottom portion of a typical directory listing. Notice that some directory listings provide the name of the server software as well as the version number. An adept Web administrator could fake these server tags, but most often this information is legitimate and exactly the type of information an attacker will use to refine his attack against the server.
image
Figure 3.5 
The Google query used to locate servers this way is simply an extension of the intitle:index.of query. The listing shown was located with a query of intitle:index.of “server at”. This query will locate all directory listings on the Web with index of in the title and server at anywhere in the text of the page. This might not seem like a very specific search, but the results are very clean and do not require further refinement.
To search for a specific server version, the intitle:index.of query can be extended even further to something like intitle:index.of “Apache/1.3.27 Server at”. This query would find pages like the one listed in Figure 3.5.
In addition to identifying the Web server version, it is also possible to determine the operating system of the server as well as modules and other software that is installed. We’ll look at more specific techniques to accomplish this later, but the server versioning technique we’ve just looked at can be extended by including more details in our query.
One convention used by these sprawling tags is the use of parenthesis to offset the operating system of the server. For example, Apache/1.3.26 (Unix) indicates a UNIX-based operating system. Other more specific tags are used as well, some of which are listed below.
CentOS
Debian
Debian GNU/Linux
Fedora
FreeBSD
Linux/SUSE
Linux/SuSE
NETWARE
Red Hat
Ubuntu
UNIX
Win32
An attacker can use the information in these operating system tags in conjunction with the Web server version tag to formulate a specific attack. If this information does not hint at a specific vulnerability, an attacker can still use this information in a data-mining or information-gathering campaign, as we will see in a later chapter.

Going out on a limb: traversal techniques

The next technique we’ll examine is known as traversal. Traversal in this context simply means to travel across. Attackers use traversal techniques to expand a small “foothold” into a larger compromise.

Directory Traversal

To illustrate how traversal might be helpful, consider a directory listing that was found with intitle:index.of inurl:“admin”.
In this example, our query brings us to a relative URL of /admin/php/tour. If you look closely at the URL, you’ll notice an “admin” directory two directory levels above our current location. If we were to click the “parent directory” link, we would be taken up one directory, to the “php” directory. Clicking the “parent directory” link from the “envr” directory would take us to the “admin” directory, a potentially juicy directory. This is very basic directory traversal. We could explore each and every parent directory and each of the subdirectories, looking for juicy stuff. Alternatively, we could use a creative site search combined with an inurl search to locate a specific file or term inside a specific subdirectory, such as site:anu.edu inurl:admin ws_ftp.log, for example. We could also explore this directory structure by modifying the URL in the address bar.
Regardless of how we were to “walk” the directory tree, we would be traversing outside the Google search, wandering around on the target Web server. This is basic traversal, specifically directory traversal. Another simple example would be replacing the word admin with the word student or public. Another more serious traversal technique could allow an attacker to take advantage of software flaws to traverse to directories outside the Web server directory tree. For example, if a Web server is installed in the /var/www directory, and public Web documents are placed in /var/www/htdocs, by default any user attaching to the Web server’s top-level directory is really viewing files located in /var/www/htdocs. Under normal circumstances, the Web server will not allow Web users to view files above the /var/www/htdocs directory. Now, let’s say a poorly coded third-party software product is installed on the server that accepts directory names as arguments. A normal URL used by this product might be www.somesadsite.org/badcode.pl?page=/index.html. This URL would instruct the badcode.pl program to “fetch” the file located at /var/www/htdocs/index.html and display it to the user, perhaps with a nifty header and footer attached. An attacker might attempt to take advantage of this type of program by sending a URL such as www.somesad-site.org/badcode.pl?page=../../../etc/passwd. If the badcode.pl program is vulnerable to a directory traversal attack, it would break out of the /var/www/htdocs directory, crawl up to the real root directory of the server, dive down into the /etc directory, and “fetch” the system password file, displaying it to the user with a nifty header and footer attached!
Automated tools can do a much better job of locating these types of files and vulnerabilities, if you don’t mind all the noise they create. If you’re a programmer, you will be very interested in the Libwhisker Perl library, written and maintained by Rain Forest Puppy (RFP) and available from www.wiretrip.net/rfp. Security Focus wrote a great article on using Libwhisker. That article is available from www.securityfocus.com/infocus/1798. If you aren’t a programmer, RFP’s Whisker tool, also available from the Wiretrip site, is excellent, as are other tools based on Libwhisker, such as nikto, written by [email protected], which is said to be updated even more than the Whisker program itself. Another tool that performs (amongst other things) file and directory mining is Wikto from SensePost that can be downloaded at www.sensepost.com/research/wikto. The advantage of Wikto is that it does not suffer from false positives on Web sites that responds with friendly 404 messages.

Incremental Substitutions

Another technique similar to traversal is incremental substitution. This technique involves replacing numbers in a URL in an attempt to find directories or files that are hidden, or unlinked from other pages. Remember that Google generally only locates files that are linked from other pages, so if it’s not linked, Google won’t find it. (Okay, there’s an exception to every rule. See the FAQ at the end of this chapter.) As a simple example, consider a document called exhc-1.xls, found with Google. You could easily modify the URL for that document, changing the 1 to a 2, making the filename exhc-2.xls. If the document is found, you have successfully used the incremental substitution technique! In some cases it might be simpler to use a Google query to find other similar files on the site, but remember, not all files on the Web are in Google’s databases. Use this technique only when you’re sure a simple query modification won’t find the files first.
This technique does not apply only to filenames, but just about anything that contains a number in a URL, even parameters to scripts. Using this technique to toy with parameters to scripts is beyond the scope of this book, but if you’re interested in trying your hand at some simple file or directory substitutions, look up some test sites with queries such as filetype:xls inurl:1.xls or intitle:index.of inurl:0001 or even an images search for 1.jpg. Now use substitution to try to modify the numbers in the URL to locate other files or directories that exist on the site. Here are some examples:
/docs/bulletin/1.xls could be modified to /docs/bulletin/2.xls
/DigLib_thumbnail/spmg/hel/0001/H/ could be changed to/DigLib_thumbnail/spmg/hel/0002/H/
/gallery/wel008-1.jpg could be modified to /gallery/wel008-2.jpg

Extension Walking

We’ve already discussed file extensions and how the filetype operator can be used to locate files with specific file extensions. For example, we could easily search for HTM files with a query such as filetype:HTM1. Once you’ve located HTM files, you could apply the substitution technique to find files with the same file name and different extension. For example, if you found /docs/index.htm, you could modify the URL to /docs/index.asp to try to locate an index.asp file in the docs directory. If this seems somewhat pointless, rest assured, this is, in fact, rather pointless. We can, however, make more intelligent substitutions. Consider the directory listing. This listing shows evidence of a very common practice, the creation of backup copies of Web pages.
Backup files can be a very interesting find from a security perspective. In some cases, backup files are older versions of an original file. Backup files on the Web have an interesting side effect: they have a tendency to reveal source code. Source code of a Web page is quite a find for a security practitioner, because it can contain behind-the-scenes information about the author, the code creation and revision process, authentication information, and more.
To see this concept in action, consider the directory listing. Clicking the link for index.php will display that page in your browser with all the associated graphics and text, just as the author of the page intended. If this were an HTM or HTML file, viewing the source of the page would be as easy as right-clicking the page and selecting view source. PHP files, by contrast, are first executed on the server. The results of that executed program are then sent to your browser in the form of HTML code, which your browser then displays. Performing a view source on HTML code that was generated from a PHP script will not show you the PHP source code, only the HTML. It is not possible to view the actual PHP source code unless something somewhere is misconfigured. An example of such a misconfiguration would be copying the PHP code to a filename that ends in something other than PHP, like BAK. Most Web servers do not understand what a BAK file is. Those servers, then, will display a PHP.BAK file as text. When this happens, the actual PHP source code is displayed as text in your browser. PHP source code can be quite revealing, showing things like Structured Query Language (SQL) queries that list information about the structure of the SQL database that is used to store the Web server’s data.
The easiest way to determine the names of backup files on a server is to locate a directory listing using intitle:index.of or to search for specific files with queries such as intitle:index.of index.php.bak or inurl:index.php.bak. Directory listings are fairly uncommon, especially among corporate-grade Web servers. However, remember that Google’s cache captures a snapshot of a page in time. Just because a Web server isn’t hosting a directory listing now, doesn’t mean the site never displayed a directory listing. One page was found in Google’s cache and was displayed as a directory listing because an index.php (or similar file) was missing. In this case, if you were to visit the server on the Web, it would look like a normal page because the index file has since been created. Clicking the cache link, however, shows this directory listing, leaving the list of files on the server exposed. This list of files can be used to intelligently locate files that still most likely exist on the server (via URL modification) without guessing at file extensions.
Directory listings also provide insight into the file extensions that are in use in other places on the site. If a system administrator or Web authoring program creates backup files with a .BAK extension in one directory, there’s a good chance that BAK files will exist in other directories as well.

Summary

The Google cache is a powerful tool in the hands of the advanced user. It can be used to locate old versions of pages that may expose information that normally would be unavailable to the casual user. The cache can be used to highlight terms in the cached version of a page, even if the terms were not used as part of the query to find that page. The cache can also be used to view a Web page anonymously via the &strip = 1 URL parameter, and can be used as a basic transparent proxy server. An advanced Google user will always pay careful attention to the details contained in the cached page’s header, since there can be important information about the date the page was crawled, the terms that were found in the search, whether the cached page contains external images, links to the original page, and the text of the URL used to access the cached version of the page. Directory listings provide unique behind-the-scenes views of Web servers, and directory traversal techniques allow an attacker to poke around files that may not be intended for public view.

Fast track solutions

Anonymity With Caches

Clicking the cache link will not only load the page from Google’s database, it will also connect to the real server to access graphics and other non-HTML content.
Adding &strip = 1 to the end of a cached URL will only show the HTML of a cached page. Accessing a cached page in this way will not connect to the real server on the Web, and could protect your anonymity if you use the cut and paste method shown in this chapter.

Locating Directory Listings

Directory listings contain a great deal of invaluable information.
The best way to home in on pages that contain directory listings is with a query such as intitle:index.of “parent directory” or intitle:index.of name size.

Locating Specific Directories in a Listing

You can easily locate specific directories in a directory listing by adding a directory name to an index.of search. For example, intitle:index.of inurl:backup could be used to find directory listings that have the word backup in the URL. If the word backup is in the URL, there’s a good chance it’s a directory name.

Locating Specific Files in a Directory Listing

You can find specific files in a directory listing by simply adding the filename to an index.of query, such as intitle:index.of ws_ftp.log.

Server Versioning With Directory Listings

Some servers, specifically Apache and Apache derivatives, add a server tag to the bottom of a directory listing. These server tags can be located by extending an index.of search, focusing on the phrase server at – for example, intitle:index.of server.at.
You can find specific versions of a Web server by extending this search with more information from a correctly formatted server tag. For example, the query intitle:index.of server.at “Apache Tomcat/” will locate servers running various versions of the Apache Tomcat server.

Directory Traversal

Once you have located a specific directory on a target Web server, you can use this technique to locate other directories or subdirectories.
An easy way to accomplish this task is via directory listings. Simply click the parent directory link, which will take you to the directory above the current directory. If this directory contains another directory listing, you can simply click links from that page to explore other directories. If the parent directory does not display a directory listing, you might have to resort to a more difficult method, guessing directory names and adding them to the end of the parent directory’s URL. Alternatively, consider using site and inurl keywords in a Google search.

Incremental Substitution

Incremental substitution is a fancy way of saying “take one number and replace it with the next higher or lower number.”
This technique can be used to explore a site that uses numbers in directory or filenames. Simply replace the number with the next higher or lower number, taking care to keep the rest of the file or directory name identical (watch those zeroes!). Alternatively, consider using site with either inurl or filetype keywords in a creative Google search.

Extension Walking

This technique can help locate files (for example, backup files) that have the same filename with a different extension.
The easiest way to perform extension walking is by replacing one extension with another in a URL – replacing html with bak, for example.
Directory listings, especially cached directory listings, are easy ways to determine whether backup files exist and what kinds of file extensions might be used on the rest of the site.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.145.97.170