I can just apply some knowledge about my site, where I happen to know that all the legitimate pages on my site just end with a slash in their URL. So, let's go ahead and modify this again, to strip out anything that doesn't end with a slash:
URLCounts = {}
with open (logPath, "r") as f:
for line in (l.rstrip() for 1 in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
agent = access['user_agent']
if (not('bot' in agent or 'spider' in agent or
'Bot' in agent or 'Spider' in agent or
'W3 Total Cache' in agent or agent =='-')):
request = access['request']
fields = request.split()
if (len(fields) == 3):
(action, URL, protocol) = fields
if (URL.endswith("/")):
if (action == 'GET'):
if URLCounts.has_key(URL):
URLCounts[URL] = URLCounts[URL] + 1
else:
URLCounts[URL] = 1
results = sorted(URLCounts, key=lambda i: int(URLCounts[i]), reverse=True)
for result in results[:20]:
print result + ": " + str(URLCounts[result])
Let's run that!
Finally, we're getting some results that seem to make sense! So, it looks like, that the top page requested from actual human beings on my little No-Hate News site is the homepage, followed by orlando-headlines, followed by world news, followed by the comics, then the weather, and the about screen. So, this is starting to look more legitimate.
If you were to dig even deeper though, you'd see that there are still problems with this analysis. For example, those feed pages are still coming from robots just trying to get RSS data from my website. So, this is a great parable in how a seemingly simple analysis requires a huge amount of pre-processing and cleaning of the source data before you get results that make any sense.
Again, make sure the things you're doing to clean your data along the way are principled, and you're not just cherry-picking problems that don't match with your preconceived notions. So, always question your results, always look at your source data, and look for weird things that are in it.