Implement a hit counter, keyed to the page you want to count. This can be used to display a counter in the page. The same technique can be used to record other types of information as well, such as the number of times each of a set of banner ads has been served.
This recipe discusses access counting, using hit counters for the examples. Counters that display the number of times a web page has been accessed are not such a big thing as they used to be, presumably because page authors now realize that most visitors don’t really care how popular a page is. Still, the general concept has application in several contexts. For example, if you’re displaying banner ads in your pages (Recipe 18.8), you may be charging vendors by the number of times you serve their ads. To do so, you need to count the number of accesses for each one. You can adapt the technique shown in this section for such purposes.
There are several methods for writing a page that displays a count of the number of times it has been accessed. The most basic is to maintain the count in a file. When the page is requested, open the file, read the count, increment it, and write the new count back to the file and display it in the page. This has the advantage of being easy to implement and the disadvantage that it requires a counter file for each page that includes a hit count. It also doesn’t work properly if two clients access the page at the same time, unless you implement some kind of locking protocol in the file access procedure. It’s possible to reduce counter file litter by keeping multiple counts in a single file, but that makes it more difficult to access particular values within the file, and it doesn’t solve the simultaneous-access problem. In fact, it makes it worse, because a multiple-counter file has a higher likelihood of being accessed by multiple clients simultaneously than does a single-counter file. So you end up implementing storage and retrieval methods for processing the file contents, and locking protocols to keep multiple processes from interfering with each other. Hmm... those sound suspiciously like the problems that a database management system such as MySQL already takes care of! Keeping the counts in the database centralizes them into a single table, SQL provides the storage and retrieval interface, and the locking problem goes away because MySQL serializes access to the table so that clients can’t interfere with each other. Furthermore, depending on how you manage the counters, you might be able to update the counter and retrieve the new sequence value using a single statement.
I’ll assume that you want to log hits for more than one page. To
do that, create a table that has one row for each page to be counted.
This means that it’s necessary to have a unique identifier for each
page, so that counters for different pages don’t get mixed up. You
could assign identifiers somehow, but it’s easier just to use the
page’s path within your web tree. Web programming languages typically
make this path easy to obtain; in fact, we’ve already discussed how to
do so in Writing Scripts That Generate Web Forms. On that
basis, you can create a hitcount
table as follows:
CREATE TABLE hitcount ( path VARCHAR(255) CHARACTER SET latin1 COLLATE latin1_general_cs NOT NULL, hits BIGINT UNSIGNED NOT NULL, PRIMARY KEY (path) );
This table definition involves some assumptions:
The path
column that
stores page pathnames has a character set of latin1
and a case-sensitive collation of
latin1_general_cs
. Use of a
case-sensitive collation is appropriate for a web platform where
pathnames are case-sensitive, such as most versions of Unix. For
Windows or for HFS+ filesystems under Mac OS X, filenames are not
case-sensitive, so you would choose a collation that is not
case-sensitive, such as latin1_swedish_ci
. If your filesystem is
set up to use pathnames in a different character set, you should
change the character set and collation.
The path
column has a
maximum length of 255 characters, which limits you to page paths
no longer than that.
The path
column is
indexed as a PRIMARY
KEY
to require unique values. Either a
PRIMARY
KEY
or UNIQUE
index is required because we will
implement the hit-counting algorithm using an INSERT
statement with an ON
DUPLICATE
KEY
UPDATE
clause to insert a row if none
exists for the page or update the row if it does exist. (Using Sequence Generators as Counters provides background that
further explains ON
DUPLICATE
KEY
UPDATE
.)
The table is set up to count page hits for a single document
tree, such as when your web server is used to serve pages for a
single domain. If you institute a hit count mechanism on a host
that serves multiple virtual domains, you may want to add a column
for the domain name. This value is available in the SERVER_NAME
value that Apache puts into
your script’s environment. In this case, the hitcount
table index should include both
the hostname and the page path.
The general logic involved in hit counter maintenance is to
increment the hits
column of the
row for a page, and then retrieve the updated counter value:
UPDATE hitcount SET hits = hits + 1 WHERE path = 'page path
'; SELECT hits FROM hitcount WHERE path = 'page path
';
Unfortunately, if you use that approach, you may often not get
the correct value. If several clients request the same page
simultaneously, several UPDATE
statements may be issued in close temporal proximity. The following
SELECT
statements then wouldn’t
necessarily get the corresponding hits
value. This can be avoided by using a
transaction or by locking the hitcount
table, but that slows down hit
counting. MySQL provides a solution that enables each client to
retrieve its own count, no matter how many updates happen at the same
time:
UPDATE hitcount SET hits = LAST_INSERT_ID(hits+1) WHERE path = 'page path
';
SELECT LAST_INSERT_ID();
The basis for updating the count here is LAST_INSERT_ID(
expr
)
, which is discussed in Using Sequence Generators as Counters. The UPDATE
statement finds the relevant row and
increments its counter value. The use of LAST_INSERT_ID(hits+1)
rather than just
hits+1
tells MySQL to treat the
value as though it were an AUTO_INCREMENT
value. This allows it to be
retrieved in the second statement using LAST_INSERT_ID()
. The LAST_INSERT_ID()
function returns a
connection-specific value, so you always get back the value
corresponding to the UPDATE
issued
on the same connection. In addition, the SELECT
statement doesn’t need to query a
table, so it’s very fast.
However, there’s still a problem here. What if the page isn’t
listed in the hitcount
table? In
that case, the UPDATE
statement
finds no row to modify and you get a counter value of zero. You could
deal with this problem by requiring that any page that includes a hit
counter must be registered in the hitcount
table before the page goes online.
An easier approach is to use MySQL’s INSERT
... ON
DUPLICATE
KEY
UPDATE
syntax, which inserts a row with a
count of 1 if it does not exist, and updates its counter if it does
exist:
INSERT INTO hitcount (path,hits) VALUES('some path',LAST_INSERT_ID(1)) ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1);
The counter value then can be retrieved as the value of
the
LAST_INSERT_ID()
function:
SELECT LAST_INSERT_ID();
The first time you request a count for a page, the statement
inserts a row because the page won’t be listed in the table yet. The
statement creates a new counter and returns a value of one. For each
request thereafter, the statement updates the existing row for the
page with the new count. That way, web page designers can include
counters in pages with no advance preparation required to initialize
the hitcount
table with a row for
the page.
A further efficiency can be gained by eliminating the SELECT
statement altogether, which is possible if your API provides a
means for direct retrieval of the most recent sequence number. For
example, in Perl, you can update the count and get the new value with
only one SQL statement like this:
$dbh->do ("INSERT INTO hitcount (path,hits) VALUES(?,LAST_INSERT_ID(1)) ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)", undef, $page_path); $count = $dbh->{mysql_insertid};
To make the counter mechanism easier to use, put the code in a utility function that takes a page path as an argument and returns the count. In Perl, a hit-counting function might look like this, in which the arguments are a database handle and the page path:
sub get_hit_count { my ($dbh, $page_path) = @_; my $count; $dbh->do ("INSERT INTO hitcount (path,hits) VALUES(?,LAST_INSERT_ID(1)) ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)", undef, $page_path); $count = $dbh->{mysql_insertid}; return $count }
The CGI.pm script_name()
function returns the local part of the URL, so you use
get_hit_count()
like
this:
my $count = get_hit_count ($dbh, script_name ()); print p ("This page has been accessed $count times.");
The counter-update mechanism involves a single SQL statement, so it is unnecessary to use transactions or explicit table locking to prevent race conditions that might result if multiple clients simultaneously request a page.
A Ruby version of the hit counter looks like this:
def get_hit_count(dbh, page_path) dbh.do("INSERT INTO hitcount (path,hits) VALUES(?,LAST_INSERT_ID(1)) ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1)", page_path) return dbh.func(:insert_id) end
Use the counter method as follows:
self_path = ENV["SCRIPT_NAME"] count = get_hit_count(dbh, self_path) page << cgi.p { "This page has been accessed " + count.to_s + " times." }
In Python, the counting function looks like this:
def get_hit_count (conn, page_path): cursor = conn.cursor () cursor.execute (""" INSERT INTO hitcount (path,hits) VALUES(%s,LAST_INSERT_ID(1)) ON DUPLICATE KEY UPDATE hits = LAST_INSERT_ID(hits+1) """, (page_path,)) cursor.close () return (conn.insert_id ())
Use the function as follows:
self_path = os.environ["SCRIPT_NAME"] count = get_hit_count (conn, self_path) print "<p>This page has been accessed %d times.</p>" % count
The recipes
distribution
includes demonstration hit counter scripts for Perl, Ruby, PHP, and
Python under the apache/hits
directory. A JSP version is under the tomcat/mcb directory. Install any of these
in your web tree, invoke it from your browser a few times, and watch
the count increase. First, you’ll need to create the hitcount
table, as well as the hitlog
table described in Web Page Access Logging. (The hit-counting scripts
show a count and also a log of the most recent hits. Web Page Access Logging discusses the logging
mechanism.) Both tables can be created using the hits.sql script provided in the tables directory.
52.14.240.252