You don’t want to use MySQL to log accesses for just a few pages, as shown in Web Page Access Logging. You want to log all page accesses, and you don’t want to have to put logging code in each page explicitly.
The uses for MySQL in a web context aren’t limited just to page generation and processing. You can use it to help you run the web server itself. For example, most Apache servers are set up to log a record of page requests to a file. But it’s also possible to send log records to a program instead, from which you can write the records wherever you like—such as to a database. With log records in a database rather than a flat file, the log becomes more highly structured and you can apply SQL analysis techniques to it. Logfile analysis tools may be written to provide some flexibility, but often this is a matter of deciding which summaries to display and which to suppress. It’s more difficult to tell a tool to display information it wasn’t built to provide. With log entries in a table, you gain additional flexibility. Want to see a particular report? Write the SQL statements that produce it. To display the report in a specific format, issue the statements from within an API and take advantage of your language’s output production capabilities.
By handling log entry generation and storage using separate processes, you gain some additional flexibility. Some of the possibilities are to send logs from multiple web servers to the same MySQL server, or to send different logs generated by a given web server to different MySQL servers.
This recipe shows how to set up web request logging from Apache into MySQL and demonstrates some summary queries you may find useful.
Apache logging is controlled by directives in the httpd.conf configuration file. For
example, a typical logging setup uses LogFormat
and CustomLog
directives that look like
this:
LogFormat "%h %l %u %t "%r" %>s %b" common CustomLog /usr/local/apache/logs/access_log common
The
LogFormat
line
defines a format for log records and gives it the nickname common
. The CustomLog
directive indicates that lines
should be written in that format to the access_log file in Apache’s logs directory. To set up logging to
MySQL instead, use the following procedure:[21]
Decide what values you want to record and set up a table that contains the appropriate columns.
Write a program to read log lines from Apache and write them into the database.
Set up a LogFormat
line
that defines how to write log lines in the format the program
expects, and a CustomLog
directive that tells Apache to write to the program rather than
to a file.
Suppose that you want to record the date and time of each request, the host that issued the request, the request method and URL pathname, the status code, the number of bytes transferred, the referring page, and the user agent (typically a browser or spider name). A table that includes columns for these values can be created as follows:
CREATE TABLE httpdlog ( dt DATETIME NOT NULL, # request date host VARCHAR(255) NOT NULL, # client host method VARCHAR(4) NOT NULL, # request method (GET, PUT, etc.) url VARCHAR(255) # URL path CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL, status INT NOT NULL, # request status size INT, # number of bytes transferred referer VARCHAR(255), # referring page agent VARCHAR(255) # user agent );
Most of the string columns use VARCHAR
and are not case-sensitive. The
exception, url
, is declared with
a case-sensitive collation as is appropriate for a server running on
a system with case-sensitive filenames. See Web Page Access Counting for notes on choosing
the character set and collation for the path
column.
The httpdlog
table
definition shown here doesn’t include any indexes. If you plan to
run summary queries, you should add appropriate indexes to the
table. Otherwise, the summaries will slow down dramatically as the
table becomes large. The choice of which columns to index will be
based on the types of statements you intend to run to analyze the
table contents. For example, statements to analyze the distribution
of client host values will benefit from an index on the host
column.
Next, you need a program to process log lines produced by
Apache and insert them into the httpdlog
table. The following script,
httpdlog.pl, opens a connection
to the MySQL server, and then loops to read input lines. It parses
each line into column values and inserts the result into the
database. When Apache exits, it closes the pipe to the logging
program. That causes httpdlog.pl
to see end of file on its input, terminate the loop, disconnect from
MySQL, and exit.
#!/usr/bin/perl # httpdlog.pl - Log Apache requests to httpdlog table # path to directory containing Cookbook.pm (CHANGE AS NECESSARY) use lib qw(/usr/local/lib/mcb); use strict; use warnings; use Cookbook; my $dbh = Cookbook::connect (); my $sth = $dbh->prepare (qq{ INSERT DELAYED INTO httpdlog (dt,host,method,url,status,size,referer,agent) VALUES (?,?,?,?,?,?,?,?) }); while (<>) # loop reading input { chomp; my ($dt, $host, $method, $url, $status, $size, $refer, $agent) = split (/ /, $_); # map "-" to NULL for some columns $size = undef if $size eq "-"; $agent = undef if $agent eq "-"; $sth->execute ($dt, $host, $method, $url, $status, $size, $refer, $agent); } $dbh->disconnect ();
Install the httpdlog.pl
script where you want Apache to look for it. On my system, the
Apache root directory is /usr/local/apache, so /usr/local/apache/bin is a reasonable
installation directory. The path to this directory will be needed
shortly for constructing the CustomLog
directive that instructs Apache
to log to the script.
The purpose of including the use
lib
line is so that Perl can find the Cookbook.pm module. This line will be
necessary if the environment of scripts invoked by Apache for
logging does not enable Perl to find the module. Change the path as
necessary for your system.
The script uses INSERT
DELAYED
rather than INSERT
. The advantage of using DELAYED
is that the MySQL server buffers
the row in memory and then later inserts a batch of rows at a time,
which is more efficient. This also enables the client to continue
immediately rather than having to wait if the table is busy. The
disadvantage is that if the MySQL server crashes, any rows buffered
in memory at the time are lost. I figure that this is not very
likely, and that the loss of a few log records is not a serious
problem. If you disagree, just remove DELAYED
from the statement.
httpdlog.pl assumes that
input lines contain httpdlog
column values delimited by tabs (to make it easy to break apart
input lines), so Apache must write log entries in a matching format.
The LogFormat
field specifiers to
produce the appropriate values are shown in the following
table.
Specifier | Meaning |
---|---|
%{%Y-%m-%d %H:%M:%S}t
| The date and time of the
request, in MySQL’s DATETIME format |
%h
| The host from which the request originated |
%m
| The request method
(get , post , and so forth) |
%U
| The URL path |
%>s
| The status code |
%b
| The number of bytes transferred |
%{Referer}i
| The referring page |
%{User-Agent}i
| The user agent |
To define a logging format named mysql
that produces these values with tabs
in between, add the following LogFormat
directive to your httpd.conf file:
LogFormat "%{%Y-%m-%d %H:%M:%S}t %h %m %U %>s %b %{Referer}i %{User-Agent}i" mysql
Most of the pieces are in place now. We have a log table, a
program that writes to it, and a mysql
format for producing log entries.
All that remains is to tell Apache to write the entries to the
httpdlog.pl script. However,
until you know that the output format really is correct and that the
program can process log entries properly, it’s premature to tell
Apache to log directly to the program. To make testing and debugging
a bit easier, have Apache log mysql
-format entries to a file instead.
That way, you can look at the file to check the output format, and
you can use it as input to httpdlog.pl to verify that the program
works correctly. To instruct Apache to log lines in mysql
format to the file test_log in Apache’s log directory, use
this CustomLog
directive:
CustomLog /usr/local/apache/logs/test_log mysql
Then restart Apache to enable the new logging directives. After your web server receives a few requests, take a look at the test_log file. Verify that the contents are as you expect, and then feed the file to httpdlog.pl:
%/usr/local/apache/bin/httpdlog.pl test_log
After httpdlog.pl finishes,
take a look at the httpdlog
table
to make sure that it looks correct. Once you’re satisfied, tell
Apache to send log entries directly to httpdlog.pl by modifying the CustomLog
directive as follows:
CustomLog "|/usr/local/apache/bin/httpdlog.pl" mysql
The |
character at the
beginning of the pathname tells Apache that httpdlog.pl is a program, not a file.
Restart Apache and new entries should appear in the httpdlog
table as visitors request pages
from your site.
Nothing you have done to this point changes any logging you
may have been doing originally. For example, if you were logging to
an access_log file before, you
still are now. Thus, Apache will be sending entries both to the
original logfile and to MySQL. If that’s what you want, fine. Apache
doesn’t care if you log to multiple destinations. But you’ll use
more disk space if you do. To disable file logging, comment out your
original CustomLog
directive by
placing a #
character in front of
it, and then restart Apache.
Now that you have Apache logging into the database, what can you do with the information? That depends on what you want to know. Here are some examples that show the kinds of questions you can use MySQL to answer easily:
How many rows are in the request log?
SELECT COUNT(*) FROM httpdlog;
How many different client hosts have sent requests?
SELECT COUNT(DISTINCT host) FROM httpdlog;
How many different pages have clients requested?
SELECT COUNT(DISTINCT url) FROM httpdlog;
What are the 10 most popular pages?
SELECT url, COUNT(*) AS count FROM httpdlog GROUP BY url ORDER BY count DESC LIMIT 10;
How many requests have been received for those favicon.ico files that certain browsers like to check for?
SELECT COUNT(*) FROM httpdlog WHERE url LIKE '%/favicon.ico%';
What is the range of dates spanned by the log?
SELECT MIN(dt), MAX(dt) FROM httpdlog;
How many requests have been received each day?
SELECT DATE(dt) AS day, COUNT(*) FROM httpdlog GROUP BY day;
Answering this question requires stripping off the
time-of-day part from the dt
values so that requests received on a given date can be grouped.
The statement does this using the DATE()
function to convert
DATETIME
values to DATE
values. However, if you intend to
run a lot of statements that use just the date part of the
dt
values, it would be more
efficient to create the httpdlog
table with separate DATE
and TIME
columns, change the LogFormat
directive to produce the
date and time as separate output values, and modify httpdlog.pl accordingly. Then you can
operate on the request dates directly without stripping off the
time, and you can index the date column for even better
performance.
What is the hour-of-the-day request histogram?
SELECT HOUR(dt) AS hour, COUNT(*) FROM httpdlog GROUP BY hour;
What is the average number of requests received each day?
SELECT COUNT(*)/(DATEDIFF(MAX(dt), MIN(dt)) + 1) FROM httpdlog;
The numerator is the total number of requests in the table. The denominator is the number of days spanned by the records.
What is the longest URL recorded in the table?
SELECT MAX(LENGTH(url)) FROM httpdlog;
If the url
column is
defined as VARCHAR(255)
and
this statement produces a value of 255, it’s likely that some
URL values were too long to fit in the column and were truncated
at the end. To avoid this, change the column definition to allow
more characters. For example, to allow up to 5,000 characters,
modify the url
column as
follows:
ALTER TABLE httpdlog MODIFY url VARCHAR(5000) CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL;
What is the total number of bytes served and the average bytes per request?
SELECT COUNT(size) AS requests, SUM(size) AS bytes, AVG(size) AS 'bytes/request' FROM httpdlog;
The statement uses COUNT(size)
rather than COUNT(*)
to count only those requests
with a non-NULL
size
value. If a client requests a
page twice, the server may respond to the second request by
sending a header indicating that the page hasn’t changed rather
than by sending content. In this case, the log entry for the
request will have NULL
in the
size
column.
How much traffic has there been for each kind of file (based on filename extension such as .html, .jpg, or .php)?
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(url,'?',1),'.',-1) AS extension, COUNT(size) AS requests, SUM(size) AS bytes, AVG(size) AS 'bytes/request' FROM httpdlog WHERE url LIKE '%.%' GROUP BY extension;
The WHERE
clause
selects only url
values that
have a period in them, to eliminate pathnames that refer to
files that have no extension. To extract the extension values
for the output column list, the inner SUBSTRING_INDEX()
call strips
off any parameter string at the right end of the URL and leaves
the rest. (This turns a value like /cgi-bin/script.pl?id=43
into
/cgi-bin/script.pl
. If the
value has no parameter part, SUBSTRING_INDEX()
returns the
entire string.) The outer SUBSTRING_INDEX()
call strips
everything up to and including the rightmost period from the
result, leaving only the extension.
The preceding discussion shows a simple method for hooking
Apache to MySQL, which involves writing a short script that
communicates with MySQL and then telling Apache to write to the
script rather than to a file. This works well if you log all
requests to a single file, but certainly won’t be appropriate for
every possible configuration that Apache is capable of. For example,
if you have virtual servers defined in your httpd.conf file, you might have separate
CustomLog
directives defined for
each of them. To log them all to MySQL, you can change each
directive to write to httpdlog.pl, but then you’ll have a
separate logging process running for each virtual server. That
brings up two issues:
How do you associate log records with the proper virtual
server? One solution is to create a separate log table for each
server and modify httpdlog.pl
to take an argument that indicates which table to use. Another
is to use a table that has a vhost
column, an Apache log format
that includes the %v
virtual
host format specifier, and a logging script that uses the
vhost
value when it generates
INSERT
statements. The
apache/httpdlog directory
of the recipes
distribution
contains information about doing this.
Do you really want a lot of httpdlog.pl processes running? If you have many virtual servers, you may want to consider using a logging module that installs directly into Apache. Some of these can multiplex logging for multiple virtual hosts through a single connection to the database server, reducing resource consumption for logging activity.
Logging to a database rather than to a file enables you to bring the full power of MySQL to bear on log analysis, but it doesn’t eliminate the need to think about space management. Web servers can generate a lot of activity, and log records use space regardless of whether you write them to a file or to a database. One way to save space is to expire records now and then. For example, to remove log records that are more than a year old, run the following statement periodically:
DELETE FROM httpdlog WHERE dt < NOW() - INTERVAL 1 YEAR;
If you have MySQL 5.1 or higher, you can set up an event that
runs the DELETE
statement on a
scheduled basis (Using Events to Schedule Database Actions).
With respect to disk space consumed by web logging activity,
be aware that if you have query logging enabled for the MySQL
server, each request will be written to the httpdlog
table and also to the query
logfile. Thus, you may find disk space disappearing more quickly
than you expect, so it’s a good idea to have some kind of logfile
rotation or expiration set up for the MySQL server.
[21] If you’re using logging directives such as TransferLog
rather than LogFormat
and CustomLog
, you’ll need to adapt the
instructions in this section.
18.188.135.58