So, I went and got the following little snippet of code off of the Internet that will parse an Apache access log line into a bunch of fields:
format_pat= re.compile( r"(?P<host>[d.]+)s" r"(?P<identity>S*)s" r"(?P<user>S*)s" r"[(?P<time>.*?)]s" r'"(?P<request>.*?)"s' r"(?P<status>d+)s" r"(?P<bytes>S*)s" r'"(?P<referer>.*?)"s' r'"(?P<user_agent>.*?)"s*' )
This code contains things like the host, the user, the time, the actual page request, the status, the referrer, user_agent (meaning which browser actually was used to view this page). It builds up what's called a regular expression, and we're using the re library to use it. That's basically a very powerful language for doing pattern matching on a large string. So, we can actually apply this regular expression to each line of our access log, and automatically group the bits of information in that access log line into these different fields. Let's go ahead and run this.
The obvious thing to do here, let's just whip up a little script that counts up each URL that we encounter that was requested, and keeps count of how many times it was requested. Then we can sort that list and get our top pages, right? Sounds simple enough!
So, we're going to construct a little Python dictionary called URLCounts. We're going to open up our log file, and for each line, we're going to apply our regular expression. If it actually comes back with a successful match for the pattern that we're trying to match, we'll say, Okay this looks like a decent line in our access log.
Let's extract the request field out of it, which is the actual HTTP request, the page which is actually being requested by the browser. We're going to split that up into its three components: it consists of an action, like get or post; the actual URL being requested; and the protocol being used. Given that information split out, we can then just see if that URL already exists in my dictionary. If so, I will increment the count of how many times that URL has been encountered by 1; otherwise, I'll introduce a new dictionary entry for that URL and initialize it to the value of 1. I do that for every line in the log, sort the results in reverse order, numerically, and print them out:
URLCounts = {}
with open(logPath, "r") as f:
for line in (l.rstrip() for l in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
request = access['request']
(action, URL, protocol) = request.split()
if URLCounts.has_key(URL):
URLCounts[URL] = URLCounts[URL] + 1
else:
URLCounts[URL] = 1
results = sorted(URLCounts, key=lambda i: int(URLCounts[i]), reverse=True)
for result in results[:20]:
print result + ": " + str(URLCounts[result])
So, let's go ahead and run that:
Oops! We end up with this big old error here. It's telling us that, we need more than 1 value to unpack. So apparently, we're getting some requests fields that don't contain an action, a URL, and a protocol that they contain something else.
Let's see what's going on there! So, if we print out all the requests that don't contain three items, we'll see what's actually showing up. So, what we're going to do here is a similar little snippet of code, but we're going to actually do that split on the request field, and print out cases where we don't get the expected three fields.
URLCounts = {}
with open(logPath, "r") as f:
for line in (l.rstrip() for l in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
request = access['request']
fields = request.split()
if (len(fields) != 3):
print fields
Let's see what's actually in there:
So, we have a bunch of empty fields. That's our first problem. But, then we have the first field that's full just garbage. Who knows where that came from, but it's clearly erroneous data. Okay, fine, let's modify our script.