Chapter 8. Logging and Real-time Analytics with MongoDB

Throughout this book, many concepts you already knew were presented to you. You have learned how to use them in conjunction with the techniques and tools that MongoDB offers to us. The goal of this chapter is to apply these techniques in a real-life example.

The real-life example we will develop in this chapter explains how to use MongoDB as a persistent storage for a web server's log data—to be more specific, the data from an Nginx web server. By doing this, we will be able to analyze the traffic data of a web application.

We will start this chapter by analyzing the Nginx log format in order to define the information that will be useful for our experiment. After this, we will define the type of analysis we want to perform in MongoDB. Finally, we will design our database schema and implement, by using code, reading and writing data in MongoDB.

For this chapter, we will take into consideration that each host that generates this event consumes this information and sends it to MongoDB. Our focus will not be on the application's architecture or on the code we will produce in our example. So, kind reader, if you do not agree with the code snippets shown here, please, feel free to modify them or create a new one yourself.

That said, this chapter will cover:

  • Log data analysis
  • What we are looking for
  • Designing the schema

Log data analysis

The access log is often ignored by developers, system administrators, or anyone who keeps services on the web. But it is a powerful tool when we need prompt feedback on what is happening for each request on our web server.

The access log keeps information about the server's activities and performance and also tells us about eventual problems. The most common web servers nowadays are Apache HTTPD and Nginx. These web servers have by default two log types: error logs and access logs.

Error logs

As the name suggests, the error log is where the web server will store errors found during the processing of a received request. In general, this type of log is configurable and will write the messages according to the predefined severity level.

Access logs

The access log is where all the received and processed requests are stored. This will be the main object of our study.

The events written in the file are recorded in a predefined layout that can be formatted according to the wishes of those who are managing the server. By default, both Apache HTTPD and Nginx have a format known as combined. An example of a log generated in this format is presented as follows:

191.32.254.162 - - [29/Mar/2015:16:04:08 -0400] "GET /admin HTTP/1.1" 200 2529 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36"

At first sight, it can be a little frightening to see too much information in only one log line. However, if we take a look at the pattern that is being applied in order to generate this log and try to examine it, we will see that it is not so difficult to understand.

The pattern that generates this line on the Nginx web server is presented as follows:

$remote_addr - $remote_user [$time_local] "$request" $status $body_bytes_sent "$http_referer" "$http_user_agent"

We will describe each part of this pattern so you get a better understanding:

  • $remote_addr: This is the IP address of the client that performed the request on the web server. In our example, this value corresponds to 191.32.254.162.
  • $remote_user: This is the authenticated user, if it exists. When an authenticated user is not identified, this field will be filled with a hyphen. In our example, the value is -.
  • [$time_local]: This is the time that the request was received on the web server in the format [day/month/year:hour:minute:second zone].
  • "$request": This is the client request itself. Also known as a request line. To get a better understanding, we will analyze our request line in the example: "GET /admin HTTP/1.1". First, we have the HTTP verb used by the client. In this case, it was a GET HTTP verb. In the sequence, we have the resource accessed by the client. In this case, the resource accessed was /admin. And last, we have the protocol used by the client. In this case, HTTP/1.1.
  • $status: This is the HTTP status code replied to the client by the web server. The possible values for this field are defined in RFC 2616. In our example, the web server returns the status code 200 to the client.

    Note

    To learn more about RFC 2616, you can visit http://www.w3.org/Protocols/rfc2616/rfc2616.txt.

  • $body_bytes_sent: This is the length in bytes of the response body sent to the client. When you do not have a body, the value will be a hyphen. We must observe that this value does not include the request headers. In our example, the value is 2,529 bytes.
  • "$http_referer": This is the contained value in the header "Referer" of the client's request. This value represents where the requested resource is referenced from. When the resource access is performed directly, this field is filled with a hyphen. In our example, the value is -.
  • "$http_user_agent": This is the information that the client sends in the User-Agent header. Normally, it is in this header that we can identify the web browser used on the request. In our example, the value for this field is "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2272.104 Safari/537.36".

Besides these, we have more variables available to create new log formats. Among these, we highlight:

  • $request_time: This indicates the total time for the request processing
  • $request_length: This indicates the total length for the client response, including the headers

Now that we are familiar with the web server access log, we will define what we want to analyze in order to know what the information needs to be logged.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.244.250