We'll actually just throw out any lines that don't have the expected 3 fields in the request. That seems like a legitimate thing to do, because this does in fact have completely useless data inside of it, it's not like we're missing out on anything here by doing that. So, we'll modify our script to do that. We've introduced an if (len(fields) == 3) line before it actually tries to process it. We'll run that:
URLCounts = {}
with open(logPath, "r") as f:
for line in (l.rstrip() for l in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
request = access['request']
fields = request.split()
if (len(fields) == 3):
URL = fields[1]
if URLCounts.has_key(URL):
URLCounts[URL] = URLCounts[URL] + 1
else:
URLCounts[URL] = 1
results = sorted(URLCounts, key=lambda i: int(URLCounts[i]), reverse=True)
for result in results[:20]:
print result + ": " + str(URLCounts[result])
Hey, we got a result!
But this doesn't really look like the top pages on my website. Remember, this is a news site. So, we're getting a bunch of PHP file hits, that's Perl scripts. What's going on there? Our top result is this xmlrpc.php script, and then WP_login.php, followed by the homepage. So, not very useful. Then there is robots.txt, then a bunch of XML files.
You know when I looked into this later on, it turned out that my site was actually under a malicious attack; someone was trying to break into it. This xmlrpc.php script was the way they were trying to guess at my passwords, and they were trying to log in using the login script. Fortunately, I shut them down before they could actually get through to this website.
This was an example of malicious data being introduced into my data stream that I have to filter out. So, by looking at that, we can see that not only was that malicious attack looking at PHP files, but it was also trying to execute stuff. It wasn't just doing a get request, it was doing a post request on the script to actually try to execute code on my website.