Chapter 11. Summarizing Logs with Awk

One of the tasks that awk is really good at is filtering data from log files. These log files may be many lines in length, perhaps 250,000 or more. I have worked with data with over a millions lines. Awk can process these lines quickly and effectively. As an example, we will work with a web server access log with 30,000 lines to show how effective and well written awk code can be. As we work our way through the chapter, we will also see different log files and review some of the techniques that we can employ with the awk command and the awk programming language to help with the reporting and administration of our services. In this chapter we will cover the following topics:

  • HTTPD log file format
  • Displaying data from web server logs
  • Summarizing HTTP access codes
  • Displaying the highest ranking client IP addresses
  • Listing browser data
  • Working with e-mail logs

The HTTPD log file format

When working with any a file, the first task is to become familiar with the file schema. In simple terms, we need to know what is represented by each field and what is used to delimit the fields. We will be working with the access log file from an Apache HTTPD web server. The location of the log file can be controlled from the httpd.conf file. The default log file location on a Debian based system is /var/log/apache2/access.log; other systems may use the httpd directory in place of apache2.

To demonstrate the layout of the file, I have installed a brand new instance of Apache2 on an Ubuntu 15.10 system. Once the web server was installed, we made a single access from the Firefox browser to the server from the local host.

Using the tail command we can display the content of the log file. Although, to be fair, the use of cat will do just as well with this file, as it will have just a few lines:

# tail /var/log/apache2/access.log

The output of the command and the contents of the file are shown in the following screenshot:

The HTTPD log file format

The output does wrap a little onto the new lines but we do get a feel of the layout of the log. We can also see that even though we feel that we access just one web page, we are in fact accessing two items: the index.html and the ubuntu-logo.png. We also failed to access the favicon.ico file. We can see that the file is space separated. The meaning of each of the fields is laid out in the following table:

Field

Purpose

1

Client IP address.

2

Client identity as defined by RFC 1413 and the identd client. This is not read unless IdentityCheck is enabled. If it is not read the value will be with a hyphen.

3

The user ID of the user authentication if enabled. If authentication is not enabled the value will be a hyphen.

4

The date and time of the request in the format of day/month/year:hour:minute:second offset.

5

The actual request and method.

6

The return status code, such as 200 or 404.

7

File size in bytes.

Even though these fields are defined by Apache, we have to be careful. The time, date, and time-zone is a single field and is defined within square braces; however, there are additional spaces inside the field between that data and the time-zone. To ensure that we print the complete time field if required, we need to print both $4 and $5. This is shown in the following command example:

# awk ' { print $4,$5 } ' /var/log/apache2/access.log

We can view the command and the output it produces in the following screenshot:

The HTTPD log file format
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.48.161