One of the tasks that awk is really good at is filtering data from log files. These log files may be many lines in length, perhaps 250,000 or more. I have worked with data with over a millions lines. Awk can process these lines quickly and effectively. As an example, we will work with a web server access log with 30,000 lines to show how effective and well written awk code can be. As we work our way through the chapter, we will also see different log files and review some of the techniques that we can employ with the awk
command and the awk programming language to help with the reporting and administration of our services. In this chapter we will cover the following topics:
When working with any a file, the first task is to become familiar with the file schema. In simple terms, we need to know what is represented by each field and what is used to delimit the fields. We will be working with the access log file from an Apache HTTPD web server. The location of the log file can be controlled from the httpd.conf
file. The default log file location on a Debian based system is /var/log/apache2/access.log
; other systems may use the httpd
directory in place of apache2
.
To demonstrate the layout of the file, I have installed a brand new instance of Apache2 on an Ubuntu 15.10 system. Once the web server was installed, we made a single access from the Firefox browser to the server from the local host.
Using the tail
command we can display the content of the log file. Although, to be fair, the use of cat
will do just as well with this file, as it will have just a few lines:
# tail /var/log/apache2/access.log
The output of the command and the contents of the file are shown in the following screenshot:
The output does wrap a little onto the new lines but we do get a feel of the layout of the log. We can also see that even though we feel that we access just one web page, we are in fact accessing two items: the index.html
and the ubuntu-logo.png
. We also failed to access the favicon.ico
file. We can see that the file is space separated. The meaning of each of the fields is laid out in the following table:
Even though these fields are defined by Apache, we have to be careful. The time, date, and time-zone is a single field and is defined within square braces; however, there are additional spaces inside the field between that data and the time-zone. To ensure that we print the complete time field if required, we need to print both $4
and $5
. This is shown in the following command example:
# awk ' { print $4,$5 } ' /var/log/apache2/access.log
We can view the command and the output it produces in the following screenshot:
13.59.48.161