You'd have to be pretty out of touch not to have been inundated by Web and Internet information in the past few years. What you might not have realized is that most Web sites you visit are running on UNIX systems! Hence this hour; if you're learning UNIX, there's a very good chance that it's so that you can work with your Web server.
One of the most delightful aspects of the online community is the sense of cooperation that I've found pervasive in the years I've been involved. This cooperation extends from the development of the UNIX system itself, the creation and group evolution of individual utilities including Perl, Tin, Elm, Usenet, and the X Window, to the maintenance of otherwise obsolete products. |
It was out of an amalgamation of all of these that Apache was born.
Back in the ancient early days of the Internet, the National Center for Supercomputer Applications (NCSA) at the University of Illinois, Urbana-Champaign, was the hub of Web development. The Web browser they invented, NCSA Mosaic, went on to great acclaim, winning industry awards and forever changing the online world. Less well known is the seminal work the NCSA team did on the underlying engine, the Web server. By early 1995, however, the popular NCSA Web server was facing obsolescence as the developers went on to other projects.
Rather than let this valuable software die from lack of attention, the Internet community came to the rescue, creating a loose consortium of programmers and developers. Consolidating a wide variety of patches, bug fixes, and enhancement ideas, they released a new, improved server they called Apache.
Zoom forward five years and Apache is by far the most popular Web server on the World Wide Web, with over half of all Web sites running either Apache or a variant. Even better, versions of Apache are available for all major UNIX platforms and even Windows 95, 98, and NT. (Sorry, Mac folk, you'll need to use Apache through the MachTen environment rather than as a native MacOS program. I'll talk about this further in the final lesson of this book.)
If you've had a chance to visit http://www.intuitive.com/tyu24/, you've been interacting with an Apache Web server! |
Apache is typically unpacked in either /src or /usr/src, though the source distribution can live anywhere on your UNIX machine. A bit of detective work and you can track down where the executable lives (the executable is the compiled program that can be run directly by the operating system—which means, yes, Apache is written in C, not Perl).
The first step in trying to find out where your Web server lives is to see whether there's a directory called /etc/httpd on your system. If there is, you're in luck—that will have the central configuration and log files.
If not, you can try using the ps processor status command to see whether you have an http daemon (search for “httpd”) running: That will probably give you the name of the configuration file. On a UNIX system running Sun Microsystems's Solaris operating system, here's what I found out:
% ps -ef | grep httpd | head -1
websrv 21156 8677 0 14:51:25 ? 0:00 /opt/INXapache/bin/httpd -f /opt/INXapache/conf/httpd.conf
By contrast, when I ran the ps command on my Linux system (with the slightly different arguments expected: It uses aux instead of -ef), the output is rather different:
% ps aux | grep httpd | head -1
nobody 313 0.0 0.8 1548 1052 ? S 15:49 0:00 httpd -d /etc/httpd
In the former case, you can see that the configuration file is specified as /opt/INXapache/conf/httpd.conf, and in the latter case, the program is using the directory /etc/httpd as the starting point to search for the configuration file. Either way, we've found the mystery configuration file and can proceed.
Now that you've found your configuration file, I have to warn you: There are lots and lots of options you don't want to even worry about or touch. If you want to change anything with the configuration of your Apache server, you'll want to study it closely before changing a single line. This doesn't mean that you can't peek inside and find out a few things about your configuration, however!
There are three configuration files in most Apache installations and, confusingly, most of the configuration options we're interested in can appear in any of them. The best way to work, therefore, is to use grep across all three files at once (access.conf, srm.conf and httpd.conf).
I'll start by checking to ensure that I have permission to run CGI programs, a crucial capability for creating sophisticated, interactive Web sites. The option is called ExecCGI:
% grep -n ExecCGI *.conf
access.conf:16:Options Indexes FollowSymLinks ExecCGI
access.conf:24:Options ExecCGI Indexes
access.conf:61:Options Indexes FollowSymLinks ExecCGI
On this particular server the CGI configuration elements are contained in the access.conf file. The -n flag adds line numbers, so it's easy to look at a 10-line slice centered on the first occurrence, for example:
% cat -n access.conf | head -21 | tail -10
12
13 # This should be changed to whatever you set DocumentRoot to.
14
15 <Directory /web/fryeboots.com>
16 Options Indexes FollowSymLinks ExecCGI
17 AllowOverride None
18 order allow,deny
19 allow from all
20 </Directory>
21
You can see that the directory /web/fryeboots.com is configured to access indexes (for example, to look for a default file called index.html in the directory), follow symbolic links (letting the developer have more flexibility in organizing files), and execute CGI programs.
Now the question is, “What domain is associated with the directory 4this defines?”
CGI is the Common Gateway Interface, the environment through which all Web-based programs are run by the server. They're 95% identical to a program that presents output on the screen in UNIX, but they have a few added features that make them Web-friendly. |
My friend grep comes into play again to answer this question. What we're looking for is the root of the Web site. In Apache parlance, it's the DocumentRoot, so I'll build a two-part grep to zero in on exactly the line I want:
% grep -n /web/fryeboots.com *.conf | grep DocumentRoot
httpd.conf:312:DocumentRoot /web/fryeboots.com
httpd.conf:320:DocumentRoot /web/fryeboots.com
Either one of these lines could be what I seek, so let's take a 20-line slice of the configuration file and see what's there:
% cat -n httpd.conf | head -323 | tail -20 304 ErrorLog /web/onlinepartner.com/logs/error_log 305 TransferLog /web/onlinepartner.com/logs/access_log 306 </VirtualHost> 307 308 ## fryeboots.com 309 310 <VirtualHost fryeboots.com> 311 ServerAdmin [email protected] 312 DocumentRoot /web/fryeboots.com 313 ServerName fryeboots.com 314 ErrorLog /web/fryeboots.com/logs/error_log 315 TransferLog /web/fryeboots.com/logs/access_log 316 </VirtualHost> 317 318 <VirtualHost www.fryeboots.com> 319 ServerAdmin [email protected] 320 DocumentRoot /web/fryeboots.com 321 ServerName fryeboots.com 322 ErrorLog /web/fryeboots.com/logs/error_log 323 TransferLog /web/fryeboots.com/logs/access_log
Notice the ErrorLog and TransferLog values (that's where the log files that we'll examine later are being stored), and notice that, as has already been shown, the home directory of this Web site can be found at /web/fryeboots.com.
You can also extrapolate some interesting information about an overall Apache configuration by slicing the preceding listing differently. I'll show you one example.
You can identify all the domains served by an Apache Web server by simply searching for the VirtualHost line:
% grep '<VirtualHost ' *.conf
httpd.conf:#<VirtualHost host.foo.com>
httpd.conf:<VirtualHost www.hostname.com>
httpd.conf:<VirtualHost www.p3tech.com>
httpd.conf:<VirtualHost www.rjcolt.com>
httpd.conf:<VirtualHost www.aeshoes.com>
httpd.conf:<VirtualHost www.coolmedium.com>
httpd.conf:<VirtualHost www.thisdate.com>
httpd.conf:<VirtualHost www.realintelligence.com>
httpd.conf:<VirtualHost www.marsee.net>
httpd.conf:<VirtualHost www.star-design.com>
httpd.conf:<VirtualHost www.jamesarmstrong.com>
httpd.conf:<VirtualHost www.trivial.net>
httpd.conf:<VirtualHost www.p3m.com>
httpd.conf:<VirtualHost www.touralaska.com>
httpd.conf:<VirtualHost www.nbarefs.com>
httpd.conf:<VirtualHost www.canp.org>
httpd.conf:<VirtualHost www.voices.com>
httpd.conf:<VirtualHost www.huntalaska.net>
httpd.conf:<VirtualHost www.cbhma.org>
httpd.conf:<VirtualHost www.cal-liability.com>
httpd.conf:<VirtualHost www.juliovision.com>
httpd.conf:<VirtualHost www.intuitive.com>
httpd.conf:<VirtualHost www.sagarmatha.com>
httpd.conf:<VirtualHost www.sportsstats.com>
httpd.conf:<VirtualHost www.videomac.com>
httpd.conf:<VirtualHost www.birthconnect.org>
httpd.conf:<VirtualHost www.birth.net>
httpd.conf:<VirtualHost www.baby.net>
httpd.conf:<VirtualHost www.chatter.net>
httpd.conf:<VirtualHost fryeboots.com>
httpd.conf:<VirtualHost www.fryeboots.com>
httpd.conf:<VirtualHost www.savetz.com>
httpd.conf:<VirtualHost www.atarimagazines.com>
httpd.conf:<VirtualHost www.faq.net>
I know, you're thinking, Wow! All those domains are on a single machine?
That's part of the beauty of Apache; it's very easy to have a pile of different domains all serving up their Web pages from a single shared machine.
There are many kinds of configurations for Apache server installations, so don't get too anxious if you can't seem to figure out the information just shown. Most likely, if you're on a shared server, you've been told the information we're peeking at anyway. You'll know the three key answers: where all your Web pages should live in your account (probably public_html or www in your account directory), whether you can run CGI programs from your Web space (that's what the ExecCGI indicates in the preceding configuration), and where your log files are stored (often it's just httpd_log or httpd_access.log in your own directory). |
The Apache Web server demonstrates all that's wonderful about the synergy created by an easy, high-speed communications mechanism shared by thousands of bright and creative people. A cooperative development effort, each release of Apache is a significant improvement over the preceding release, and Apache keeps getting better and better. And as for the price, it's hard to argue with free! |
Knowing how to pop in and look at the configuration of an Apache installation is a good way to find out what capabilities you have as a user of the system. Although thousands of companies offer Web site hosting, the number that actually give you all the information you want are few, so having your own tricks (and an extensive knowledge of UNIX, of course) is invaluable.
Now I know where the files associated with my Web site live (the DocumentRoot directory in the configuration file) and have ascertained that I have CGI execution permission (ExecCGI). It's time to exploit this information by creating a simple CGI program that's actually a shell script, to demonstrate how UNIX and the Web can work hand in hand. |
First off, a very basic CGI script to demonstrate the concept:
% cat hello.cgi #!/usr/bin/perl -w print "Content-type: text/html "; print "<h1>Hello there!</h1> "; exit 0; % chmod a+x hello.cgi
When this is invoked from within a Web browser, as with http://www.intuitive.com/tyu24/hello.cgi, the results are as shown in Figure 23.1.
Not very exciting, but it's a skeleton for developing considerably more complex programs.
Let's jump to a more complex CGI script, one that's actually a UNIX shell script. This time the layout will be quite similar, but I'm going to include the output of an ls command and then add a timestamp with date too:
% cat files.cgi
#!/bin/sh -f
echo "Content-type: text/html"
echo ""
echo "<h2>This directory contains the following files:</h2>"
echo "<PRE>"
ls -l
echo "</PRE>"
echo "<h4>output produced at"
date
echo "</h4>"
exit 0
The output is considerably more interesting, as shown in Figure 23.2.
It's useful to know that CGI scripts such as the two shown here can also be executed directly on the command line to test them while you're working in UNIX:
% cat hello.cgi Content-type: text/html <h1>Hello there!</h1> % cat files.cgi Content-type: text/html <h2>This directory contains the following files:</h2> <PRE> total 84 drwxr-xr-x 2 taylor taylor 1024 Jul 17 1997 BookGraphics drwxr-xr-x 2 taylor taylor 1024 Jul 15 1997 Graphics -rw-r--r-- 1 taylor taylor 2714 Jun 9 23:16 authors.html -rw-r--r-- 1 taylor taylor 1288 Jun 9 23:03 buy.html drwxrwxrwx 2 taylor taylor 1024 Sep 16 08:08 exchange -rw-r--r-- 1 taylor taylor 2903 Jun 9 23:03 faves.html -rwxrwxr-x 1 taylor taylor 203 Sep 16 17:45 files.cgi -rwxrwxr-x 1 taylor taylor 97 Sep 16 17:50 hello.cgi -rw-r--r-- 1 taylor taylor 671 Jun 9 23:04 index.html -rw-r--r-- 1 taylor taylor 1328 Mar 13 1998 leftside.html -rw-r--r-- 1 taylor taylor 1603 Jun 9 23:03 main.shtml -rw-r--r-- 1 taylor taylor 14073 Jun 9 23:03 netiq.html -rw-r--r-- 1 taylor taylor 2733 Jun 9 23:08 reviews.html -rw-r--r-- 1 taylor taylor 35242 Jun 9 23:04 sample.html -rw-r--r-- 1 taylor taylor 2424 Jun 9 23:04 toc.html -rw-r--r-- 1 taylor taylor 8904 May 12 1997 tyu24.gif </PRE> <h4>output produced at Wed Sep 16 17:50:46 PDT 1998 </h4> %
This can be incredibly helpful with debugging your CGI programs as you develop them.
Now let's peek at the new exchange.cgi program that offers the currency conversion program on a Web page.
I'm not going to include the listing here because the script has evolved into a rather extensive program (much of which deals with receiving information from the user within the Web environment through the QUERY_STRING, a topic beyond the scope of this book).
Here's a teaser, however:
% head -33 exchange.cgi
#!/usr/bin/perl
&read_exchange_rate; # read exchange rate into memory
# now let's cycle, asking the user for input...
print "Content-type: text/html
";
print "<HTML><TITLE>Currency Exchange Calculator</TITLE>
";
print "<BODY BGCOLOR=white LINK='#999999' VLINK='#999999'>
";
print "<CENTER><h1>Foreign Currency Exchange Calculator</h1>
";
$value = $ENV{ "QUERY_STRING"}; # value from FORM
if ( $value ne undef ) {
($amnt,$curr) = &breakdown();
$baseval = $amnt * (1/$rateof{ $curr});
print "<CENTER><TABLE BORDER=1 CELLPADDING=7><TR>
";
print "<TH>U.S.Dollars</TH><TH>French Franc</TH>
";
print "<TH>German Mark</TH><TH>Japanese Yen</TH>
";
print "<TH>British Pound</TH></TR>
";
print "<TR>
";
printf("<TD>%2.2f USD</TD>", $baseval * $rateof{'U'});
printf("<TD>%2.2f Franc</TD>", $baseval * $rateof{'F'});
printf("<TD>%2.2f DM</TD>", $baseval * $rateof{'D'});
printf("<TD>%2.2f Yen</TD>", $baseval * $rateof{'Y'});
printf("<TD>%2.2f Pound</TD></TR></TABLE>
", $baseval * $rateof{'P'});
}
Figure 23.3 shows what the program looks like when you've entered an amount to be converted and the program is showing the results.
Check out the exchange rate Web page for yourself. Go to http://www.intuitive.com/tyu24/exchange/. |
I'll be honest with you. Learning the ins and outs of CGI programming within the UNIX environment is complex, and you'll need to study for a while before you can whip off CGI scripts that perform specific tasks. Throw in a dose of interactivity, where you glean information from the user and produce a page based on their data, and it's 10 times more complex. |
If you want to learn more about Web programming and the HTML markup language that you're seeing throughout these examples, I suggest you read another book of mine, a best-seller: Creating Cool HTML 4 Web Pages. You can learn more about it online (of course) at http://www.intuitive.com/coolweb/.
The capability to have CGI programs that are shell scripts, Perl programs, or even extensive C programs adds a great degree of flexibility, but sometimes you want to be able to just add a tiny snippet to an existing HTML page. That's where server-side includes, SSI, can come in super-handy. |
Many sites are configured to prevent you from using the SSI capability for security reasons, so you might not be able to follow along exactly in this portion of the lesson. Those sites that do allow SSI almost always insist on a .shtml suffix, rather than the usual .html suffix. That tells the Web server to parse each line as the page is sent to the user.
Here's a common task you might have encountered in a Web project: letting visitors know the date and time when each page was last modified. Sure, you can add the information explicitly on the bottom of each page and then hope you'll remember to update it every time you change the page, but it's more likely that this will be just another annoying detail to maintain.
Instead, let's automate things by using a server-side include in all the HTML files, a single line that automatically checks and displays the last modified time for the specific file. This is a perfect use for a small C program and a call to a library function called strftime.
Here's the C source to lastmod.c, in its entirety:
/** lastmod.c **/ /** This outputs the last-modified-time of the specified file in a succinct, readable format. From the book "Teach Yourself UNIX in 24 Hours" by Dave Taylor **/ #include <sys/stat.h> #include <unistd.h> #include <time.h> #include <stdio.h> main() { struct stat stbuf; struct tm *t; char buffer[32]; if (stat(getenv("SCRIPT_FILENAME"), &stbuf) == 0) { t=localtime(&stbuf.st_mtime); strftime(buffer, 32, "%a %b %d, %y at %I:%M %p", t); printf( "<font size=2 color="#999999">This page last modified %s</font> ", buffer); } exit(0); }
The environment variable SCRIPT_FILENAME is quite helpful for SSI programming; it always contains the name of the HTML file that contains the reference. That is, if you have a page called resume.html on the file system
/home/joanne/public_html, SCRIPT_FILENAME contains /home/joanne/public_html/resume.html.
Server-side includes are always referenced by their full filename, and the general format is to embed the reference in an HTML comment:
<!--#exec cmd="/web/bin/lastmod" -->
That's all that's needed in the HTML source file to automatically add a “last modified” entry to a Web page.
For an example of how to use this, here are the last 14 lines of the main HTML page for Teach Yourself UNIX in 24 Hours:
% tail -14 main.shtml
<P>
<hr width=40% noshade size=1>
<font size=2>
You're visitor
<!--#exec cmd="/web/bin/counter .counter"-->
to this site
</font>
<P>
<!--#exec cmd="/web/bin/lastmod"-->
</CENTER>
<P>
</BODY>
</HTML>
Notice that the filename suffix is .shtml to ensure that the Web server scans the file for any possible server-side includes. Also notice that the counter on the bottom of the page is also a server-side include!
Figure 23.4 shows how the bottom of this page looks when it's delivered up within a browser. Notice that the SSI lastmod sequence generates the entire phrase last modified, as well as the date.
Between server-side includes and CGI programs, you can start to see how really knowing the ins and outs of your Web server can reap great benefits in terms of creating a dramatic and smart Web site. |
There's one more area to consider with Web server interaction before this lesson is over, however: analyzing traffic logs and extracting traffic data.
Whether or not you choose to explore the capabilities of the Apache Web server for delivering custom pages through CGI programming or server-side includes, you'll undoubtedly want to dig through your Web site log files to see what has been going on. The good news is that you're running the best environment for accomplishing the task: UNIX. |
The wide set of tools and commands available within UNIX makes analyzing the otherwise cryptic and usually quite long log files a snap.
On most shared Web servers, you have a file located in your home directory called httpd.log, httpd_access.log, or something similar. If you're lucky, the log file is actually broken down into its two components: errors and fulfilled requests.
On my system, that's exactly how it works. I'm curious to learn more about the traffic that was seen on the www.trivial.net Web site recently.
Once I move into the Trivial Net area on the Web server, I know that the traffic is all kept in a directory called logs. Within that directory, here are the files available:
% ls -l
total 19337
-rw-r--r-- 1 taylor taylor 9850706 Sep 1 00:08 access_log
-rw-r--r-- 1 taylor taylor 4608036 Sep 1 00:08 agent_log
-rw-r--r-- 1 taylor taylor 5292 Sep 1 00:08 error_log
-rw-r--r-- 1 taylor taylor 5253261 Sep 1 00:08 referer_log
As you can see, not only is there an access_log and an error_log, which contain the fulfilled and failed access requests, respectively, but there also are two additional logs. The agent.log log records the Web browsers that visitors used when visiting the site, and referer_log records the page that users were viewing immediately prior to visiting the Trivial Net site.
Interested in computer trivia? Then Trivial Net is a great site to visit, at www.trivial.net. Even better, the site is built around both C- and Perl-based CGI programs, just as you've seen in this lesson. |
These are pretty huge files, as you can see. The access_log file is 9.8 megabytes! A quick invocation of the wc command and you can see the number of lines in each file instead:
% wc -l *
104883 access_log
104881 agent_log
52 error_log
102047 referer_log
311863 total
The good thing to notice is that there are only 52 lines of error messages encountered during the month of use, against 104,883 “hits” to the site.
Let's start by peeking at the error_log to see what exactly was going on:
$ head -10 error_log [Sat Aug 1 02:28:55 1998] lingering close lost connection to client hertzelia-204-51.access.net.il [Sat Aug 1 03:52:22 1998] access to /web/trivial.net/robots.txt failed for heavymetal.fireball.de, reason: File does not exist [Sat Aug 1 06:50:04 1998] lingering close lost connection to client po90.snet.ne.jp [Sun Aug 2 15:39:14 1998] access to /web/trivial.net/robots.txt failed for barn.farm.gol.net, reason: File does not exist [Sun Aug 2 17:16:34 1998] access to /web/trivial.net/robots.txt failed for nahuel.ufasta.com.ar, reason: File does not exist [Sun Aug 2 19:20:55 1998] access to /web/trivial.net/robots.txt failed for c605d216.infoseek.com, reason: File does not exist [Sun Aug 2 21:13:12 1998] access to /web/trivial.net/playgame.cgi' failed for saturn.tacorp.com, reason: File does not exist [Sun Aug 2 21:58:05 1998] access to /web/trivial.net/robots.txt failed for as016.cland.net, reason: File does not exist [Mon Aug 3 03:21:54 1998] access to /web/trivial.net/robots.txt failed for mail.bull-ingenierie.fr, reason: File does not exist [Mon Aug 3 08:58:26 1998] access to /web/trivial.net/robots.txt failed for nahuel.ufasta.com.ar, reason: File does not exist
The information in the [] indicates the exact time and date of the failed request, and the last portion indicates the actual error encountered.
Because we're using UNIX, it's easy to extract and count the failed access requests to ensure that there are no surprises in the other 42 lines of the file:
% grep 'File does not exist' error_log | cut -d -f9
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/playgame.cgi'
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/picasso
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
/web/trivial.net/robots.txt
By using the ever-helpful cut command to extract just the ninth word in each line, coupled with a grep command to extract the 'File does not exist' error message, I've extracted just the list of bad files. Now a standard pipe for this kind of work: Sort the list, and pipe to the uniq program (with the -c prefacing a count of the occurrences of each unique line) piped to another invocation of sort, this time using a reverse numeric sort (-rn) to list the most-commonly-failed filenames first:
% grep 'File does not exist' error_log | cut -d -f9 | sort | uniq - c| sort -rn
16 /web/trivial.net/robots.txt
1 /web/trivial.net/playgame.cgi'
1 /web/trivial.net/picasso
You could easily make this an alias or drop it into a shell script to avoid having to type all this each time.
Let's get to the more interesting log file, however, the access_log that contains all the successful traffic events:
% head -10 access_log
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:01 -0700]
"GET / HTTP/1.0" 200 2976
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:02 -0700]
"GET /animated-banner.gif HTTP/1.0" 200 3131
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:02 -0700]
"GET /play-the-game.gif HTTP/1.0" 200 1807
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:02 -0700]
"GET /signup-now.gif HTTP/1.0" 200 2294
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:03 -0700]
"GET /buy-the-book.gif HTTP/1.0" 200 1541
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:04 -0700]
"GET /kudos.gif HTTP/1.0" 200 804
ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:05 -0700]
"GET /intsys.gif HTTP/1.0" 200 1272
204.116.54.6 - - [01/Aug/1998:00:18:04 -0700] "GET /animated- banner.gif
HTTP/1.0" 200 3131
204.116.54.6 - - [01/Aug/1998:00:18:04 -0700] "GET /play-the-game.gif
HTTP/1.0" 200 1807
204.116.54.6 - - [01/Aug/1998:00:18:04 -0700] "GET /signup-now.gif
HTTP/1.0" 200 2294
You can see the basic layout of the information, though a bit of explanation will prove very useful. There are some fields you probably won't care about, so focus on the first (which indicates what domain the user came from), the fourth (the time and date of the access), the seventh (the requested file), and the ninth (the size, in bytes, of the resultant transfer).
For example, the first line tells you that someone from pacbell.net requested the home page of the site (that's what the / means in the request) at 12 minutes after midnight on the 1st of August, and the server sent back 2,976 bytes of information (the file index.html, as it turns out).
Every time an element of a Web page is accessed, a record of that event is added to the access_log file. To count the number of page views, therefore, simply use the -v (exclusion) feature of egrep (the regular expression brother of grep) to subtract the GIF and JPEG image references, and then count the remainder:
% cat access_log | egrep -v '(.gif|.GIF|.jpg|.JPG|.jpeg|.JPEG)' | wc - l
43211
This means that of the 104,883 hits to the site (see the output from wc earlier in this lesson), all but 43,211 were requests for graphics. In the parlance of Web traffic experts, the site had 104,883 hits, of which 43,211 were actual page views.
This is where the cut program is going to prove so critical, with its capability to extract a specific field or word of each line. First, let's quickly look at the specific files accessed and identify the top 15 requested files, whether they be graphics, HTML, or something else.
% cat access_log | cut -d -f7 | sort | uniq -c | sort -rn | head - 15
39040 /playgame.cgi
13255 /intsys.gif
12282 /banner.gif
7002 /wrong.gif
4464 /right.gif
3998 /Adverts/computerbowl.gif
3506 /
2736 /play-the-game.gif
2692 /animated-banner.gif
2660 /signup-now.gif
2652 /buy-the-book.gif
2624 /kudos.gif
1623 /final-score.gif
621 /forget.gif
411 /Results/3.gif
Interesting results—the playgame.cgi program was by far the most commonly accessed file, with the Intuitive Systems graphic a distant second place. More entertainingly, notice that people were wrong almost twice as often as they were right in answering questions (7,002 for wrong.gif versus 4,464 for right.gif).
One more calculation. This time, I'm going to use another UNIX utility that is worth knowing: awk. In many ways, awk is a precursor to Perl that isn't as flexible or robust. However, in this case it's perfect for what we want: a summary of the total number of bytes transferred for the entire month. Because awk can easily slice a line into individual fields, this is a piece of cake:
% cat access_log | awk '{ sum+=$10} END{ print sum} '
325805962
%
The quoted argument to awk is a tiny program that adds the value of field 10 to the variable sum for each line seen and, when it's done with all the lines, prints the value of the variable sum.
To put this number in context, it's 325,805,962 bytes, or about 300 megabytes of information in 30 days (or 10 megabytes/day).
The previous examples of using various UNIX tools to extract and analyze information contained in the Apache log files demonstrate not only that UNIX is powerful and capable, but also that knowing the right tools to use can make some complex tasks incredibly easy. In the last example, the awk program did all the work, and it took only a second or two to add up a line of numbers that would overwhelm even the most accomplished accountant. |
3.141.31.240