Modification two - filtering post requests

Now, I know that the data that I care about, you know in the spirit of the thing I'm trying to figure out is, people getting web pages from my website. So, a legitimate thing for me to do is to filter out anything that's not a get request, out of these logs. So, let's do that next. We're going to check again if we have three fields in our request field, and then we're also going to check if the action is get. If it's not, we're just going to ignore that line entirely:

URLCounts = {}

with open(logPath, "r") as f:
for line in (l.rstrip() for l in f):
match= format_pat.match(line)
if match:
access = match.groupdict()
request = access['request']
fields = request.split()
if (len(fields) == 3):
(action, URL, protocol) = fields
if (action == 'GET'):
if URLCounts.has_key(URL):
URLCounts[URL] = URLCounts[URL] + 1
else:
URLCounts[URL] = 1

results = sorted(URLCounts, key=lambda i: int(URLCounts[i]), reverse=True)

for result in results[:20]:
print result + ": " + str(URLCounts[result])

We should be getting closer to what we want now, the following is the output of the preceding code:

Yeah, this is starting to look more reasonable. But, it still doesn't really pass a sanity check. This is a news website; people go to it to read news. Are they really reading my little blog on it that just has a couple of articles? I don't think so! That seems a little bit fishy. So, let's dive in a little bit, and see who's actually looking at those blog pages. If you were to actually go into that file and examine it by hand, you would see that a lot of these blog requests don't actually have any user agent on them. They just have a user agent of -, which is highly unusual:

If a real human being with a real browser was trying to get this page, it would say something like Mozilla, or Internet Explorer, or Chrome or something like that. So, it seems that these requests are coming from some sort of a scraper. Again, potentially malicious traffic that's not identifying who it is.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.98.240