Working with a UNIX Web Server

You'd have to be pretty out of touch not to have been inundated by Web and Internet information in the past few years. What you might not have realized is that most Web sites you visit are running on UNIX systems! Hence this hour; if you're learning UNIX, there's a very good chance that it's so that you can work with your Web server.

Task 23.1: Exploring Apache Configuration Files

One of the most delightful aspects of the online community is the sense of cooperation that I've found pervasive in the years I've been involved. This cooperation extends from the development of the UNIX system itself, the creation and group evolution of individual utilities including Perl, Tin, Elm, Usenet, and the X Window, to the maintenance of otherwise obsolete products.


It was out of an amalgamation of all of these that Apache was born.

Back in the ancient early days of the Internet, the National Center for Supercomputer Applications (NCSA) at the University of Illinois, Urbana-Champaign, was the hub of Web development. The Web browser they invented, NCSA Mosaic, went on to great acclaim, winning industry awards and forever changing the online world. Less well known is the seminal work the NCSA team did on the underlying engine, the Web server. By early 1995, however, the popular NCSA Web server was facing obsolescence as the developers went on to other projects.

Rather than let this valuable software die from lack of attention, the Internet community came to the rescue, creating a loose consortium of programmers and developers. Consolidating a wide variety of patches, bug fixes, and enhancement ideas, they released a new, improved server they called Apache.

Zoom forward five years and Apache is by far the most popular Web server on the World Wide Web, with over half of all Web sites running either Apache or a variant. Even better, versions of Apache are available for all major UNIX platforms and even Windows 95, 98, and NT. (Sorry, Mac folk, you'll need to use Apache through the MachTen environment rather than as a native MacOS program. I'll talk about this further in the final lesson of this book.)

If you've had a chance to visit http://www.intuitive.com/tyu24/, you've been interacting with an Apache Web server!


Apache is typically unpacked in either /src or /usr/src, though the source distribution can live anywhere on your UNIX machine. A bit of detective work and you can track down where the executable lives (the executable is the compiled program that can be run directly by the operating system—which means, yes, Apache is written in C, not Perl).

  1. The first step in trying to find out where your Web server lives is to see whether there's a directory called /etc/httpd on your system. If there is, you're in luck—that will have the central configuration and log files.

    If not, you can try using the ps processor status command to see whether you have an http daemon (search for “httpd”) running: That will probably give you the name of the configuration file. On a UNIX system running Sun Microsystems's Solaris operating system, here's what I found out:

    % ps -ef | grep httpd | head -1
      websrv 21156  8677  0 14:51:25 ?   0:00 /opt/INXapache/bin/httpd -f /opt/INXapache/conf/httpd.conf
    

    By contrast, when I ran the ps command on my Linux system (with the slightly different arguments expected: It uses aux instead of -ef), the output is rather different:

    % ps aux | grep httpd | head -1
    nobody     313  0.0  0.8  1548  1052  ?  S   15:49   0:00 httpd -d /etc/httpd
    

    In the former case, you can see that the configuration file is specified as /opt/INXapache/conf/httpd.conf, and in the latter case, the program is using the directory /etc/httpd as the starting point to search for the configuration file. Either way, we've found the mystery configuration file and can proceed.

  2. Now that you've found your configuration file, I have to warn you: There are lots and lots of options you don't want to even worry about or touch. If you want to change anything with the configuration of your Apache server, you'll want to study it closely before changing a single line. This doesn't mean that you can't peek inside and find out a few things about your configuration, however!

    There are three configuration files in most Apache installations and, confusingly, most of the configuration options we're interested in can appear in any of them. The best way to work, therefore, is to use grep across all three files at once (access.conf, srm.conf and httpd.conf).

    I'll start by checking to ensure that I have permission to run CGI programs, a crucial capability for creating sophisticated, interactive Web sites. The option is called ExecCGI:

    % grep -n ExecCGI *.conf
    access.conf:16:Options Indexes FollowSymLinks ExecCGI
    access.conf:24:Options   ExecCGI  Indexes
    access.conf:61:Options Indexes FollowSymLinks ExecCGI
    

    On this particular server the CGI configuration elements are contained in the access.conf file. The -n flag adds line numbers, so it's easy to look at a 10-line slice centered on the first occurrence, for example:

    %  cat -n access.conf | head -21 | tail -10
        12
        13  # This should be changed to whatever you set DocumentRoot to.
        14
        15  <Directory /web/fryeboots.com>
        16  Options Indexes FollowSymLinks ExecCGI
        17  AllowOverride None
        18  order allow,deny
        19  allow from all
        20  </Directory>
        21
    

    You can see that the directory /web/fryeboots.com is configured to access indexes (for example, to look for a default file called index.html in the directory), follow symbolic links (letting the developer have more flexibility in organizing files), and execute CGI programs.

    Now the question is, “What domain is associated with the directory 4this defines?”

CGI is the Common Gateway Interface, the environment through which all Web-based programs are run by the server. They're 95% identical to a program that presents output on the screen in UNIX, but they have a few added features that make them Web-friendly.


  1. My friend grep comes into play again to answer this question. What we're looking for is the root of the Web site. In Apache parlance, it's the DocumentRoot, so I'll build a two-part grep to zero in on exactly the line I want:

    % grep -n /web/fryeboots.com *.conf | grep DocumentRoot
    httpd.conf:312:DocumentRoot /web/fryeboots.com
    httpd.conf:320:DocumentRoot /web/fryeboots.com
    
  2. Either one of these lines could be what I seek, so let's take a 20-line slice of the configuration file and see what's there:

    % cat -n httpd.conf | head -323 | tail -20
       304  ErrorLog /web/onlinepartner.com/logs/error_log
       305  TransferLog /web/onlinepartner.com/logs/access_log
       306  </VirtualHost>
       307
       308  ## fryeboots.com
       309
       310  <VirtualHost fryeboots.com>
       311  ServerAdmin [email protected]
       312  DocumentRoot /web/fryeboots.com
       313  ServerName fryeboots.com
       314  ErrorLog /web/fryeboots.com/logs/error_log
       315  TransferLog /web/fryeboots.com/logs/access_log
       316  </VirtualHost>
       317
       318  <VirtualHost www.fryeboots.com>
       319  ServerAdmin [email protected]
       320  DocumentRoot /web/fryeboots.com
       321  ServerName fryeboots.com
       322  ErrorLog /web/fryeboots.com/logs/error_log
       323  TransferLog /web/fryeboots.com/logs/access_log
    

    Notice the ErrorLog and TransferLog values (that's where the log files that we'll examine later are being stored), and notice that, as has already been shown, the home directory of this Web site can be found at /web/fryeboots.com.

  3. You can also extrapolate some interesting information about an overall Apache configuration by slicing the preceding listing differently. I'll show you one example.

    You can identify all the domains served by an Apache Web server by simply searching for the VirtualHost line:

    % grep '<VirtualHost ' *.conf
    httpd.conf:#<VirtualHost host.foo.com>
    httpd.conf:<VirtualHost www.hostname.com>
    httpd.conf:<VirtualHost www.p3tech.com>
    httpd.conf:<VirtualHost www.rjcolt.com>
    httpd.conf:<VirtualHost www.aeshoes.com>
    httpd.conf:<VirtualHost www.coolmedium.com>
    httpd.conf:<VirtualHost www.thisdate.com>
    httpd.conf:<VirtualHost www.realintelligence.com>
    httpd.conf:<VirtualHost www.marsee.net>
    httpd.conf:<VirtualHost www.star-design.com>
    httpd.conf:<VirtualHost www.jamesarmstrong.com>
    httpd.conf:<VirtualHost www.trivial.net>
    httpd.conf:<VirtualHost www.p3m.com>
    httpd.conf:<VirtualHost www.touralaska.com>
    httpd.conf:<VirtualHost www.nbarefs.com>
    httpd.conf:<VirtualHost www.canp.org>
    httpd.conf:<VirtualHost www.voices.com>
    httpd.conf:<VirtualHost www.huntalaska.net>
    httpd.conf:<VirtualHost www.cbhma.org>
    httpd.conf:<VirtualHost www.cal-liability.com>
    httpd.conf:<VirtualHost www.juliovision.com>
    httpd.conf:<VirtualHost www.intuitive.com>
    httpd.conf:<VirtualHost www.sagarmatha.com>
    httpd.conf:<VirtualHost www.sportsstats.com>
    httpd.conf:<VirtualHost www.videomac.com>
    httpd.conf:<VirtualHost www.birthconnect.org>
    httpd.conf:<VirtualHost www.birth.net>
    httpd.conf:<VirtualHost www.baby.net>
    httpd.conf:<VirtualHost www.chatter.net>
    httpd.conf:<VirtualHost fryeboots.com>
    httpd.conf:<VirtualHost www.fryeboots.com>
    httpd.conf:<VirtualHost www.savetz.com>
    httpd.conf:<VirtualHost www.atarimagazines.com>
    httpd.conf:<VirtualHost www.faq.net>
    

    I know, you're thinking, Wow! All those domains are on a single machine?

    That's part of the beauty of Apache; it's very easy to have a pile of different domains all serving up their Web pages from a single shared machine.

There are many kinds of configurations for Apache server installations, so don't get too anxious if you can't seem to figure out the information just shown. Most likely, if you're on a shared server, you've been told the information we're peeking at anyway. You'll know the three key answers: where all your Web pages should live in your account (probably public_html or www in your account directory), whether you can run CGI programs from your Web space (that's what the ExecCGI indicates in the preceding configuration), and where your log files are stored (often it's just httpd_log or httpd_access.log in your own directory).


The Apache Web server demonstrates all that's wonderful about the synergy created by an easy, high-speed communications mechanism shared by thousands of bright and creative people. A cooperative development effort, each release of Apache is a significant improvement over the preceding release, and Apache keeps getting better and better. And as for the price, it's hard to argue with free!


Knowing how to pop in and look at the configuration of an Apache installation is a good way to find out what capabilities you have as a user of the system. Although thousands of companies offer Web site hosting, the number that actually give you all the information you want are few, so having your own tricks (and an extensive knowledge of UNIX, of course) is invaluable.

Task 23.2: Creating a Simple CGIProgram

Now I know where the files associated with my Web site live (the DocumentRoot directory in the configuration file) and have ascertained that I have CGI execution permission (ExecCGI). It's time to exploit this information by creating a simple CGI program that's actually a shell script, to demonstrate how UNIX and the Web can work hand in hand.


  1. First off, a very basic CGI script to demonstrate the concept:

    % cat hello.cgi
    #!/usr/bin/perl -w
    print "Content-type: text/html
    
    ";
    print "<h1>Hello there!</h1>
    ";
    exit 0;
    % chmod a+x hello.cgi
    							

    When this is invoked from within a Web browser, as with http://www.intuitive.com/tyu24/hello.cgi, the results are as shown in Figure 23.1.

    Figure 23.1. Hello there—our first CGI program.

    Not very exciting, but it's a skeleton for developing considerably more complex programs.

  2. Let's jump to a more complex CGI script, one that's actually a UNIX shell script. This time the layout will be quite similar, but I'm going to include the output of an ls command and then add a timestamp with date too:

    % cat files.cgi
    #!/bin/sh -f
    echo "Content-type: text/html"
    echo ""
    echo "<h2>This directory contains the following files:</h2>"
    echo "<PRE>"
    ls -l
    echo "</PRE>"
    echo "<h4>output produced at"
    date
    echo "</h4>"
    exit 0
    

    The output is considerably more interesting, as shown in Figure 23.2.

  3. It's useful to know that CGI scripts such as the two shown here can also be executed directly on the command line to test them while you're working in UNIX:

    % cat hello.cgi
    Content-type: text/html
    
    <h1>Hello there!</h1>
    
    % cat files.cgi
    Content-type: text/html
    <h2>This directory contains the following files:</h2>
    <PRE>
    total 84
    drwxr-xr-x   2 taylor   taylor       1024 Jul 17  1997 BookGraphics
    drwxr-xr-x   2 taylor   taylor       1024 Jul 15  1997 Graphics
    -rw-r--r--   1 taylor   taylor       2714 Jun  9 23:16 authors.html
    -rw-r--r--   1 taylor   taylor       1288 Jun  9 23:03 buy.html
    drwxrwxrwx   2 taylor   taylor       1024 Sep 16 08:08 exchange
    -rw-r--r--   1 taylor   taylor       2903 Jun  9 23:03 faves.html
    -rwxrwxr-x   1 taylor   taylor        203 Sep 16 17:45 files.cgi
    -rwxrwxr-x   1 taylor   taylor         97 Sep 16 17:50 hello.cgi
    -rw-r--r--   1 taylor   taylor        671 Jun  9 23:04 index.html
    -rw-r--r--   1 taylor   taylor       1328 Mar 13  1998 leftside.html
    -rw-r--r--   1 taylor   taylor       1603 Jun  9 23:03 main.shtml
    -rw-r--r--   1 taylor   taylor      14073 Jun  9 23:03 netiq.html
    -rw-r--r--   1 taylor   taylor       2733 Jun  9 23:08 reviews.html
    -rw-r--r--   1 taylor   taylor      35242 Jun  9 23:04 sample.html
    -rw-r--r--   1 taylor   taylor       2424 Jun  9 23:04 toc.html
    -rw-r--r--   1 taylor   taylor       8904 May 12  1997 tyu24.gif
    </PRE>
    <h4>output produced at
    Wed Sep 16 17:50:46 PDT 1998
    </h4>
    %
    

    Figure 23.2. A file listing and the current date and time.

    This can be incredibly helpful with debugging your CGI programs as you develop them.

  4. Now let's peek at the new exchange.cgi program that offers the currency conversion program on a Web page.

    I'm not going to include the listing here because the script has evolved into a rather extensive program (much of which deals with receiving information from the user within the Web environment through the QUERY_STRING, a topic beyond the scope of this book).

    Here's a teaser, however:

    % head -33 exchange.cgi
    #!/usr/bin/perl
    
    &read_exchange_rate;     # read exchange rate into memory
    
    # now let's cycle, asking the user for input...
    
    print "Content-type: text/html
    
    ";
    
    print "<HTML><TITLE>Currency Exchange Calculator</TITLE>
    ";
    print "<BODY BGCOLOR=white LINK='#999999' VLINK='#999999'>
    ";
    print "<CENTER><h1>Foreign Currency Exchange Calculator</h1>
    ";
    
    $value = $ENV{ "QUERY_STRING"};          # value from FORM
    
    if ( $value ne undef ) {
    
      ($amnt,$curr) = &breakdown();
    
      $baseval = $amnt * (1/$rateof{ $curr});
    
      print "<CENTER><TABLE BORDER=1 CELLPADDING=7><TR>
    ";
      print "<TH>U.S.Dollars</TH><TH>French Franc</TH>
    ";
      print "<TH>German Mark</TH><TH>Japanese Yen</TH>
    ";
      print "<TH>British Pound</TH></TR>
    ";
      print "<TR>
    ";
    
      printf("<TD>%2.2f USD</TD>", $baseval * $rateof{'U'});
      printf("<TD>%2.2f Franc</TD>", $baseval * $rateof{'F'});
      printf("<TD>%2.2f DM</TD>", $baseval * $rateof{'D'});
      printf("<TD>%2.2f Yen</TD>", $baseval * $rateof{'Y'});
      printf("<TD>%2.2f Pound</TD></TR></TABLE>
     ", $baseval * $rateof{'P'});
    }
    

    Figure 23.3 shows what the program looks like when you've entered an amount to be converted and the program is showing the results.

    Figure 23.3. 100 Yen isn't what it used to be.

Check out the exchange rate Web page for yourself. Go to http://www.intuitive.com/tyu24/exchange/.


I'll be honest with you. Learning the ins and outs of CGI programming within the UNIX environment is complex, and you'll need to study for a while before you can whip off CGI scripts that perform specific tasks. Throw in a dose of interactivity, where you glean information from the user and produce a page based on their data, and it's 10 times more complex.


If you want to learn more about Web programming and the HTML markup language that you're seeing throughout these examples, I suggest you read another book of mine, a best-seller: Creating Cool HTML 4 Web Pages. You can learn more about it online (of course) at http://www.intuitive.com/coolweb/.

Task 23.3: A Server-Side Include Program

The capability to have CGI programs that are shell scripts, Perl programs, or even extensive C programs adds a great degree of flexibility, but sometimes you want to be able to just add a tiny snippet to an existing HTML page. That's where server-side includes, SSI, can come in super-handy.


Many sites are configured to prevent you from using the SSI capability for security reasons, so you might not be able to follow along exactly in this portion of the lesson. Those sites that do allow SSI almost always insist on a .shtml suffix, rather than the usual .html suffix. That tells the Web server to parse each line as the page is sent to the user.

  1. Here's a common task you might have encountered in a Web project: letting visitors know the date and time when each page was last modified. Sure, you can add the information explicitly on the bottom of each page and then hope you'll remember to update it every time you change the page, but it's more likely that this will be just another annoying detail to maintain.

    Instead, let's automate things by using a server-side include in all the HTML files, a single line that automatically checks and displays the last modified time for the specific file. This is a perfect use for a small C program and a call to a library function called strftime.

    Here's the C source to lastmod.c, in its entirety:

    /**                     lastmod.c                       **/
    
    /** This outputs the last-modified-time of the specified file in
        a succinct, readable format.
    
        From the book "Teach Yourself UNIX in 24 Hours" by Dave Taylor
    **/
    
    #include <sys/stat.h>
    #include <unistd.h>
    #include <time.h>
    #include <stdio.h>
    
    main()
    {
            struct stat stbuf;
            struct tm   *t;
            char   buffer[32];
    
            if (stat(getenv("SCRIPT_FILENAME"), &stbuf) == 0) {
              t=localtime(&stbuf.st_mtime);
              strftime(buffer, 32, "%a %b %d, %y at %I:%M %p", t);
              printf(
              "<font size=2 color="#999999">This page last modified %s</font>
    ",
                    buffer);
            }
            exit(0);
    }
    

    The environment variable SCRIPT_FILENAME is quite helpful for SSI programming; it always contains the name of the HTML file that contains the reference. That is, if you have a page called resume.html on the file system

    /home/joanne/public_html, SCRIPT_FILENAME contains
    /home/joanne/public_html/resume.html.
    
  2. Server-side includes are always referenced by their full filename, and the general format is to embed the reference in an HTML comment:

    <!--#exec cmd="/web/bin/lastmod" -->
    

    That's all that's needed in the HTML source file to automatically add a “last modified” entry to a Web page.

  3. For an example of how to use this, here are the last 14 lines of the main HTML page for Teach Yourself UNIX in 24 Hours:

    % tail -14 main.shtml
    <P>
    <hr width=40% noshade size=1>
    <font size=2>
    You're visitor
    <!--#exec cmd="/web/bin/counter .counter"-->
    to this site
    </font>
    <P>
    <!--#exec cmd="/web/bin/lastmod"-->
    </CENTER>
    <P>
    </BODY>
    
    </HTML>
    

    Notice that the filename suffix is .shtml to ensure that the Web server scans the file for any possible server-side includes. Also notice that the counter on the bottom of the page is also a server-side include!

    Figure 23.4 shows how the bottom of this page looks when it's delivered up within a browser. Notice that the SSI lastmod sequence generates the entire phrase last modified, as well as the date.

Between server-side includes and CGI programs, you can start to see how really knowing the ins and outs of your Web server can reap great benefits in terms of creating a dramatic and smart Web site.


There's one more area to consider with Web server interaction before this lesson is over, however: analyzing traffic logs and extracting traffic data.

Task 23.4: Understanding Apache Log Files

Whether or not you choose to explore the capabilities of the Apache Web server for delivering custom pages through CGI programming or server-side includes, you'll undoubtedly want to dig through your Web site log files to see what has been going on. The good news is that you're running the best environment for accomplishing the task: UNIX.


Figure 23.4. Server-side includes enhance Web pages.


The wide set of tools and commands available within UNIX makes analyzing the otherwise cryptic and usually quite long log files a snap.

  1. On most shared Web servers, you have a file located in your home directory called httpd.log, httpd_access.log, or something similar. If you're lucky, the log file is actually broken down into its two components: errors and fulfilled requests.

    On my system, that's exactly how it works. I'm curious to learn more about the traffic that was seen on the www.trivial.net Web site recently.

    Once I move into the Trivial Net area on the Web server, I know that the traffic is all kept in a directory called logs. Within that directory, here are the files available:

    % ls -l
    total 19337
    -rw-r--r--   1 taylor   taylor    9850706 Sep  1 00:08 access_log
    -rw-r--r--   1 taylor   taylor    4608036 Sep  1 00:08 agent_log
    -rw-r--r--   1 taylor   taylor       5292 Sep  1 00:08 error_log
    -rw-r--r--   1 taylor   taylor    5253261 Sep  1 00:08 referer_log
    

    As you can see, not only is there an access_log and an error_log, which contain the fulfilled and failed access requests, respectively, but there also are two additional logs. The agent.log log records the Web browsers that visitors used when visiting the site, and referer_log records the page that users were viewing immediately prior to visiting the Trivial Net site.

Interested in computer trivia? Then Trivial Net is a great site to visit, at www.trivial.net. Even better, the site is built around both C- and Perl-based CGI programs, just as you've seen in this lesson.


  1. These are pretty huge files, as you can see. The access_log file is 9.8 megabytes! A quick invocation of the wc command and you can see the number of lines in each file instead:

    % wc -l *
     104883 access_log
     104881 agent_log
         52 error_log
     102047 referer_log
     311863 total
    

    The good thing to notice is that there are only 52 lines of error messages encountered during the month of use, against 104,883 “hits” to the site.

  2. Let's start by peeking at the error_log to see what exactly was going on:

    $ head -10 error_log
    [Sat Aug  1 02:28:55 1998] lingering close lost connection to client
    hertzelia-204-51.access.net.il
    [Sat Aug  1 03:52:22 1998] access to /web/trivial.net/robots.txt
    failed for heavymetal.fireball.de, reason: File does not exist
    [Sat Aug  1 06:50:04 1998] lingering close lost connection to client
    po90.snet.ne.jp
    [Sun Aug  2 15:39:14 1998] access to /web/trivial.net/robots.txt
    failed for barn.farm.gol.net, reason: File does not exist
    [Sun Aug  2 17:16:34 1998] access to /web/trivial.net/robots.txt
    failed for nahuel.ufasta.com.ar, reason: File does not exist
    [Sun Aug  2 19:20:55 1998] access to /web/trivial.net/robots.txt
    failed for c605d216.infoseek.com, reason: File does not exist
    [Sun Aug  2 21:13:12 1998] access to /web/trivial.net/playgame.cgi'
    failed for saturn.tacorp.com, reason: File does not exist
    [Sun Aug  2 21:58:05 1998] access to /web/trivial.net/robots.txt
    failed for as016.cland.net, reason: File does not exist
    [Mon Aug  3 03:21:54 1998] access to /web/trivial.net/robots.txt
    failed for mail.bull-ingenierie.fr, reason: File does not exist
    [Mon Aug  3 08:58:26 1998] access to /web/trivial.net/robots.txt
    failed for nahuel.ufasta.com.ar, reason: File does not exist
    

    The information in the [] indicates the exact time and date of the failed request, and the last portion indicates the actual error encountered.

  3. Because we're using UNIX, it's easy to extract and count the failed access requests to ensure that there are no surprises in the other 42 lines of the file:

    % grep 'File does not exist' error_log | cut -d   -f9
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/playgame.cgi'
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/picasso
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    /web/trivial.net/robots.txt
    

    By using the ever-helpful cut command to extract just the ninth word in each line, coupled with a grep command to extract the 'File does not exist' error message, I've extracted just the list of bad files. Now a standard pipe for this kind of work: Sort the list, and pipe to the uniq program (with the -c prefacing a count of the occurrences of each unique line) piped to another invocation of sort, this time using a reverse numeric sort (-rn) to list the most-commonly-failed filenames first:

    % grep 'File does not exist' error_log | cut -d   -f9 | sort | uniq - c| sort -rn
         16 /web/trivial.net/robots.txt
          1 /web/trivial.net/playgame.cgi'
          1 /web/trivial.net/picasso
    

    You could easily make this an alias or drop it into a shell script to avoid having to type all this each time.

  4. Let's get to the more interesting log file, however, the access_log that contains all the successful traffic events:

    % head -10 access_log
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:01 -0700]
    "GET / HTTP/1.0" 200 2976
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:02 -0700]
    "GET /animated-banner.gif HTTP/1.0" 200 3131
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:02 -0700]
    "GET /play-the-game.gif HTTP/1.0" 200 1807
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:02 -0700]
    "GET /signup-now.gif HTTP/1.0" 200 2294
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:03 -0700]
    "GET /buy-the-book.gif HTTP/1.0" 200 1541
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:04 -0700]
    "GET /kudos.gif HTTP/1.0" 200 804
    ppp-206-170-29-29.wnck11.pacbell.net - - [01/Aug/1998:00:12:05 -0700]
    "GET /intsys.gif HTTP/1.0" 200 1272
    204.116.54.6 - - [01/Aug/1998:00:18:04 -0700] "GET /animated- banner.gif
    HTTP/1.0" 200 3131
    204.116.54.6 - - [01/Aug/1998:00:18:04 -0700] "GET /play-the-game.gif
    HTTP/1.0" 200 1807
    204.116.54.6 - - [01/Aug/1998:00:18:04 -0700] "GET /signup-now.gif
    HTTP/1.0" 200 2294
    

    You can see the basic layout of the information, though a bit of explanation will prove very useful. There are some fields you probably won't care about, so focus on the first (which indicates what domain the user came from), the fourth (the time and date of the access), the seventh (the requested file), and the ninth (the size, in bytes, of the resultant transfer).

    For example, the first line tells you that someone from pacbell.net requested the home page of the site (that's what the / means in the request) at 12 minutes after midnight on the 1st of August, and the server sent back 2,976 bytes of information (the file index.html, as it turns out).

  5. Every time an element of a Web page is accessed, a record of that event is added to the access_log file. To count the number of page views, therefore, simply use the -v (exclusion) feature of egrep (the regular expression brother of grep) to subtract the GIF and JPEG image references, and then count the remainder:

    % cat access_log | egrep  -v '(.gif|.GIF|.jpg|.JPG|.jpeg|.JPEG)' | wc - l
      43211
    

    This means that of the 104,883 hits to the site (see the output from wc earlier in this lesson), all but 43,211 were requests for graphics. In the parlance of Web traffic experts, the site had 104,883 hits, of which 43,211 were actual page views.

  6. This is where the cut program is going to prove so critical, with its capability to extract a specific field or word of each line. First, let's quickly look at the specific files accessed and identify the top 15 requested files, whether they be graphics, HTML, or something else.

    % cat access_log | cut -d   -f7 | sort | uniq -c | sort -rn | head - 15
      39040 /playgame.cgi
      13255 /intsys.gif
      12282 /banner.gif
       7002 /wrong.gif
       4464 /right.gif
       3998 /Adverts/computerbowl.gif
       3506 /
       2736 /play-the-game.gif
       2692 /animated-banner.gif
       2660 /signup-now.gif
       2652 /buy-the-book.gif
       2624 /kudos.gif
       1623 /final-score.gif
        621 /forget.gif
        411 /Results/3.gif
    

    Interesting results—the playgame.cgi program was by far the most commonly accessed file, with the Intuitive Systems graphic a distant second place. More entertainingly, notice that people were wrong almost twice as often as they were right in answering questions (7,002 for wrong.gif versus 4,464 for right.gif).

  7. One more calculation. This time, I'm going to use another UNIX utility that is worth knowing: awk. In many ways, awk is a precursor to Perl that isn't as flexible or robust. However, in this case it's perfect for what we want: a summary of the total number of bytes transferred for the entire month. Because awk can easily slice a line into individual fields, this is a piece of cake:

    % cat access_log | awk '{ sum+=$10}  END{ print sum} '
    325805962
    %
    

    The quoted argument to awk is a tiny program that adds the value of field 10 to the variable sum for each line seen and, when it's done with all the lines, prints the value of the variable sum.

    To put this number in context, it's 325,805,962 bytes, or about 300 megabytes of information in 30 days (or 10 megabytes/day).

The previous examples of using various UNIX tools to extract and analyze information contained in the Apache log files demonstrate not only that UNIX is powerful and capable, but also that knowing the right tools to use can make some complex tasks incredibly easy. In the last example, the awk program did all the work, and it took only a second or two to add up a line of numbers that would overwhelm even the most accomplished accountant.


..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.141.31.240