Filtering the activity of spiders/robots

Alright, so this gets a little bit tricky. There's no real good way of identifying spiders or robots just based on the user string alone. But we can at least take a legitimate crack at it, and filter out anything that has the word "bot" in it, or anything from my caching plugin that might be requesting pages in advance as well. We'll also strip out our friend single dash. So, we will once again refine our script to, in addition to everything else, strip out any UserAgents that look fishy:

URLCounts = {}

with open(logPath, "r") as f:
for line in (l.rstrip() for l in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
agent = access['user_agent']
if (not('bot' in agent or 'spider' in agent or
'Bot' in agent or 'Spider' in agent or
'W3 Total Cache' in agent or agent =='-')):
request = access['request']
fields = request.split()
if (len(fields) == 3):
(action, URL, protocol) = fields
if (action == 'GET'):
if URLCounts.has_key(URL):
URLCounts[URL] = URLCounts[URL] + 1
else:
URLCounts[URL] = 1

results = sorted(URLCounts, key=lambda i: int(URLCounts[i]), reverse=True)

for result in results[:20]:
print result + ": " + str(URLCounts[result])
URLCounts = {}

with open(logPath, "r") as f:
for line in (l.rstrip() for l in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
agent = access['user_agent']
if (not('bot' in agent or 'spider' in agent or
'Bot' in agent or 'Spider' in agent or
'W3 Total Cache' in agent or agent =='-')):
request = access['request']
fields = request.split()
if (len(fields) == 3):
(action, URL, protocol) = fields
if (URL.endswith("/")):
if (action == 'GET'):
if URLCounts.has_key(URL):
URLCounts[URL] = URLCounts[URL] + 1
else:
URLCounts[URL] = 1

results = sorted(URLCounts, key=lambda i: int(URLCounts[i]), reverse=True)

for result in results[:20]:
print result + ": " + str(URLCounts[result])

What do we get?

Alright, so here we go! This is starting to look more reasonable for the first two entries, the homepage is most popular, which would be expected. Orlando headlines is also popular, because I use this website more than anybody else, and I live in Orlando. But after that, we get a bunch of stuff that aren't webpages at all: a bunch of scripts, a bunch of CSS files. Those aren't web pages.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.117.105.74