12

Logging

One of the basic elements of monitoring and observability is logs. Logs allow us to detect actions that are happening in a running system. That information can be used to analyze the behavior of the system, especially any errors or bugs that may arise, giving us useful insight into what is actually going on.

Using logs correctly is deceptively difficult, though. It's easy to collect too much or too little information, or to log the wrong information. In this chapter, we will see some of the key elements of what to collect, and the general strategy to follow to ensure that logs are used to their best effect.

In this chapter, we'll cover the following topics:

  • Log basics
  • Producing logs in Python
  • Detecting problems through logs
  • Log strategies
  • Adding logs while developing
  • Log limitations

Let's start with the basic principles of logging.

Log basics

Logs are basically messages produced by the system as it runs. These messages are produced by specific pieces of code as they are executed, allowing us to track actions happening in the code.

Logs can be completely generic, like "Function X is called" or can include some context of the specifics of the execution, like "Function X is called with parameter Y."

Normally, logs are generated as plaintext messages. While there are other options, pure plaintext is very easy to deal with, can be read easily, is flexible in its format, and can be searched with pure text tools like grep. These tools are normally very fast and most developers and sysadmins know how to use them.

As well as the main message text, each log contains some metadata about what system produced the log, what time the log was created, and so on. If the log is in text format, this is normally attached to the start of the line.

A standard and consistent log format helps you with searching, filtering, and sorting the messages. Ensure that you use consistent formats across your different systems.

Another important metadata value is the severity of the log. This allows us to categorize the different logs by their relative importance. The standard severity levels, in order of less to more important, are DEBUG, INFO, WARNING, and ERROR.

The CRITICAL level is less used, but it's useful to show catastrophic errors.

It's important to categorize the logs with their proper severity and filter out unimportant messages to focus on the more important ones. Each logging facility can be configured to only produce logs at one severity level or more.

It's possible to add custom log levels instead of the predefined ones. This is generally a bad idea and should be avoided in most cases, as the log levels are well understood by all tools and engineers. We will describe later in this chapter how to define a strategy per level to make the best of each level.

In a system serving requests, either as request-response or asynchronously, most of the logs will be generated as part of dealing with a request, which will produce several logs indicating what the request is doing. Because more than one request will normally be undergoing processing at once, the logs will be generated intermixed. For example, consider the following logs:

Sept 16 20:42:04.130 10.1.0.34 INFO web: REQUEST GET /login
Sept 16 20:42:04.170 10.1.0.37 INFO api: REQUEST GET /api/login
Sept 16 20:42:04.250 10.1.0.37 INFO api: REQUEST TIME 80 ms
Sept 16 20:42:04.270 10.1.0.37 INFO api: REQUEST STATUS 200
Sept 16 20:42:04.360 10.1.0.34 INFO web: REQUEST TIME 230 ms
Sept 16 20:42:04.370 10.1.0.34 INFO web: REQUEST STATUS 200

The preceding logs show two different services, as indicated by the different IP addresses (10.1.0.34 and 10.1.0.37) and the two different service types (web and api). Though this can be enough to separate the requests, it's a good idea to create a single request ID to be able to group the requests in the following way:

Sept 16 20:42:04.130 10.1.0.34 INFO web: [4246953f8] REQUEST GET /login
Sept 16 20:42:04.170 10.1.0.37 INFO api: [fea9f04f3] REQUEST GET /api/login
Sept 16 20:42:04.250 10.1.0.37 INFO api: [fea9f04f3] REQUEST TIME 80 ms
Sept 16 20:42:04.270 10.1.0.37 INFO api: [fea9f04f3] REQUEST STATUS 200
Sept 16 20:42:04.360 10.1.0.34 INFO web: [4246953f8] REQUEST TIME 230 ms
Sept 16 20:42:04.370 10.1.0.34 INFO web: [4246953f8] REQUEST STATUS 200

In microservices environments, requests will flow from one service to the other, so it's a good idea to create a request ID that's shared across services so the full cross-service flow can be understood. To do that, the request ID needs to be created by the first service and then transmitted to the next, typically as a header in an HTTP request.

As we saw in Chapter 5, The Twelve-Factor App Methodology, in the Twelve-Factor App methodology, logs should be treated as an event stream. This means that the application itself should not be concerned with the storage and treatment of logs. Instead, the logs should be directed to stdout. From there, while developing the application, the developer can extract the information while it's running.

In production environments, stdout should be captured so that other tools can use it and then routed, annexing any different sources into a single stream, and then stored or indexed for later consulting. These tools should be configured in the production environment, and not in the app itself.

Possible tools for this rerouting include alternatives like Fluentd (https://github.com/fluent/fluentd) or even the old favorite combination of a direct to logger Linux command to create system logs and then sending those logs to a configured rsyslog (https://www.rsyslog.com/) server that can forward and aggregate them.

No matter how we collect logs, a typical system will produce a lot of them, and they need to be stored somewhere. While each individual log is small, aggregating thousands of them uses a significant amount of space. Any log system should be configured to have a policy on how much data it should accept to avoid growing indefinitely. In general, a retention policy based on time (such as keeping logs from the last 15 days) is the best approach, as it will be easy to understand. Finding the balance between how far back in the past you need to be able look and the amount of space the system uses is important.

Be sure to check the retention policy when enabling any new log service, be it local or cloud-based, to make sure it's compatible with your defined retention period. You won't be able to analyze anything that happened before the time window. Double-check that the rate of log creation is as expected and that space consumption is not making the effective time window in which you can collect logs smaller. You don't want to find out that you unexpectedly went over quota while you were tracking a bug.

Generating log entries is easy, as we will see in the next section, Producing logs in Python.

Producing logs in Python

Python includes a standard module to produce logs. This module is easy to use, with a very flexible configuration, but it can be confusing if you don't understand the way it operates.

A basic program to create logs looks like this. This is available as basic_logging.py on GitHub at https://github.com/PacktPublishing/Python-Architecture-Patterns/tree/main/chapter_12_logging:

import logging
# Generate two logs with different severity levels
logging.warning('This is a warning message')
logging.info('This is an info message')

The .warning and .info methods create logs with the corresponding severity message. The message is a text string.

When executed, it shows the following:

$ python3 basic_logging.py
WARNING:root:This is a warning message

The logs are, by default, routed to stdout, which is what we want, but it is configured not to display INFO logs. The format of the logs is also the default, which doesn't include a timestamp.

To add all this information, we need to understand the three basic elements used for logging in Python:

  • A formatter, which describes how the full log is going to be presented, attaching metadata like the timestamp or the severity.
  • A handler, which decides how the logs are propagated. It sets the format of the logs through the formatter, as defined above.
  • A logger, which produces the logs. It has one or more handlers that describe how the logs are propagated.

With this information, we can configure the logs to specify all the details we want:

import sys
import logging
# Define the format
FORMAT = '%(asctime)s.%(msecs)dZ:APP:%(name)s:%(levelname)s:%(message)s'
formatter = logging.Formatter(FORMAT, datefmt="%Y-%m-%dT%H:%M:%S")
# Create a handler that sends the logs to stdout
handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(formatter)
# Create a logger with name 'mylogger', adding the handler and setting
# the level to INFO
logger = logging.getLogger('mylogger')
logger.addHandler(handler)
logger.setLevel(logging.INFO)
# Generate three logs
logger.warning('This is a warning message')
logger.info('This is an info message')
logger.debug('This is a debug message, not to be displayed')

We define the three elements in the same order that we saw before. First the formatter, then the handler, which sets the formatter, and finally the logger, which adds the handler.

The formatter has the following format:

FORMAT = '%(asctime)s.%(msecs)dZ:APP:%(name)s:%(levelname)s:%(message)s'
formatter = logging.Formatter(FORMAT, datefmt="%Y-%m-%dT%H:%M:%S")

FORMAT is composed in Python % format, which is an old way to describe strings. Most elements are described as %(name)s, where the final s character means string format. Here's a description of each element:

  • asctime sets the timestamp in a human-readable format. We describe it in the datefmt argument to follow the ISO 8601 format. We also add the milliseconds next and a Z to get the timestamp in full ISO 8601 form. %(msecs)d with a d at the end means that we print the value as an integer. This is to limit the value to milliseconds and not show any extra resolution, which is available as a fractional value.
  • name is the name of the logger, as we will describe later. We add also APP to differentiate between different applications.
  • levelname is the severity of the log, such as INFO, WARNING, or ERROR.
  • message, finally, is the log message.

Once we have defined the formatter, we can move to the handler:

handler = logging.StreamHandler(stream=sys.stdout)
handler.setFormatter(formatter)

The handler is a StreamHandler, and we set the destination of the stream to be sys.stdout, which is the Python-defined variable that points to stdout.

There are more handlers available, like FileHandler to send the logs to a file, SysLogHandler to send logs to a syslog destination, and even more advanced cases like TimeRotatingFileHandler, which rotates the logs based on time, meaning it stores the last defined time, and archives older versions. You can see more information of all available handlers in the documentation at https://docs.python.org/3/howto/logging.html#useful-handlers.

Once the handler is defined, we can create the logger:

logger = logging.getLogger('mylogger')
logger.addHandler(handler)
logger.setLevel(logging.INFO)

The first thing to do is to create a name for the logger, which here we define as mylogger. This allows us to divide the logs of the application into subsections. We append the handler using .addHandler.

Finally, we define the level to log as INFO using the .setLevel method. This will display all logs of the level INFO and higher, while those lower won't be.

If we run the file, we see the whole configuration coming together:

$ python3 configured_logging.py
2021-09-18T23:15:24.563Z:APP:mylogger:WARNING:This is a warning message
2021-09-18T23:15:24.563Z:APP:mylogger:INFO:This is an info message

We can see that:

  • The time is defined in ISO 8601 format as 2021-09-18T23:15:24.563Z. This is a combination of the asctime and msec parameters.
  • The APP and mylogger parameters allow us to filter by application and submodule.
  • The severity is displayed. Note that there's a DEBUG message that isn't displayed, as the minimum level configured is INFO.

The logging module in Python is capable of high levels of configuration. Check the official documentation for more information at https://docs.python.org/3/library/logging.html.

Detecting problems through logs

For any problem in a running system, there are two kind of errors that can occur: expected and unexpected. In this section, we will see the differences between them in terms of logs and how we handle them.

Detecting expected errors

Expected errors are errors that are detected explicitly by creating an ERROR log in the code. For example, the following code produces an ERROR log when the accessed URL returns a status code different from 200 OK:

import logging
import requests
URL = 'https://httpbin.org/status/500'
response = requests.get(URL)
status_code = response.status_code
if status_code != 200:
    logging.error(f'Error accessing {URL} status code {status_code}')

This code, when executed, triggers an ERROR log:

$ python3 expected_error.py
ERROR:root:Error accessing https://httpbin.org/status/500 status code 500

This is a common pattern to access an external URL and validate that it has been accessed correctly. The block where the log is generated could perform some remediation or a retry, among other things.

Here, we use the https://httpbin.org service, a simple HTTP request and response service that can be used to test code. In particular, the https://httpbin.org/status/<code> endpoint returns the specified status code, making it easy to generate errors.

This is an example of an expected error. We planned in advance for something that we didn't want to happen, but understood that there's a possibility of it happening. By planning in advance, the code is ready to process the error and capture it adequately.

In this case, we can describe the situation clearly enough, and provide context to understand what is happening. The problem is obvious, even if the solution may not be.

These kinds of errors are relatively easy to deal with since they describe foreseen problems.

For example, the site may be unavailable, there could be an authentication problem, or perhaps the base URL is misconfigured.

Keep in mind that in some cases, it's possible for the code to deal with a certain situation without failing, but for it still to be considered an error. For example, maybe you want to detect if an old authentication system is still in use by someone. This method of adding ERROR or WARNING logs when deprecated actions are detected can enable you to take actions to remedy the situation.

Other examples of this type of error include connections to databases and data being stored in a deprecated format.

Capturing unexpected errors

But expected errors are not the only ones that can occur. Unfortunately, any running system will surprise you with all kinds of unexpected behavior that will break the code in creative ways. Unexpected errors in Python are normally produced by an exception being raised at some point in the code when that exception won't be captured.

For example, imagine that when making a small change to some code, we introduce a typo:

import logging
import requests
URL = 'https://httpbin.org/status/500'
logging.info(f'GET {URL}')
response = requests.ge(URL)
status_code = response.status_code
if status_code != 200:
    logging.error(f'Error accessing {URL} status code {status_code}')

Note that in line 8, we introduced a typo:

response = requests.ge(URL)

The correct .get call has been replaced by .ge. When we run it, it produces the following error:

$ python3 unexpected_error.py
Traceback (most recent call last):
  File "./unexpected_error.py", line 8, in <module>
    response = requests.ge(URL)
AttributeError: module 'requests' has no attribute 'ge'

By default in Python, it will show the error and stack trace in the stdout. When the code is executed as part of a web server, this is sometimes enough to send these messages as ERROR logs, depending on how the configuration is set up.

Any web server will capture and route these messages properly toward the logs and generate a proper 500 status code, indicating that there has been an unexpected error. The server will still be available for the next request.

If you need to create a script that needs to be running endlessly and is protected against any unexpected errors, be sure to use a try..except block as it's generic, so any possible exception will be captured and handled.

Any Python exception that's properly captured with a specific except block can be considered an expected error. Some of them may require ERROR messages to be generated, but others may be captured and handled without requiring such information.

For example, let's adjust the code to make a request every few seconds. The code is available in GitHub at https://github.com/PacktPublishing/Python-Architecture-Patterns/tree/main/chapter_12_logging:

import logging
import requests
from time import sleep
logger = logging.getLogger()
logger.setLevel(logging.INFO)
while True:
    
    try:
        sleep(3)
        logging.info('--- New request ---')
    
        URL = 'https://httpbin.org/status/500'
        logging.info(f'GET {URL}')
        response = requests.ge(URL)
        scode = response.status_code
        if scode != 200:
            logger.error(f'Error accessing {URL} status code {scode}')
    except Exception as err:
        logger.exception(f'ERROR {err}')

The key element is the following endless loop:

while True:
    try:
        code
    except Exception as err:
        logger.exception(f'ERROR {err}')

The try..except block is inside the loop, so even if there's an error, the loop will be uninterrupted. If there's any error, except Exception will capture it, no matter what the exception is.

This is sometimes referred to as Pokemon exception handling, as in "Gotta catch 'em all." This should be restricted to a kind of "last-resort safety net." In general, not being precise with the exceptions to be captured is a bad idea, as you can hide errors by handling them incorrectly. Errors should never pass silently.

To be sure that not only is the error logged, but also the full stack trace, we log it using .exception instead of .error. This extends the information over a single text message while logging it with ERROR severity.

When we run the command, we get these logs. Be sure to stop it by pressing Ctrl + C:

$ python3 protected_errors.py
INFO:root:--- New request ---
INFO:root:GET https://httpbin.org/status/500
ERROR:root:ERROR module 'requests' has no attribute 'ge'
Traceback (most recent call last):
  File "./protected_errors.py", line 18, in <module>
    response = requests.ge(URL)
AttributeError: module 'requests' has no attribute 'ge'
INFO:root:--- New request ---
INFO:root:GET https://httpbin.org/status/500
ERROR:root:ERROR module 'requests' has no attribute 'ge'
Traceback (most recent call last):
  File "./protected_errors.py", line 18, in <module>
    response = requests.ge(URL)
AttributeError: module 'requests' has no attribute 'ge'
^C
...
KeyboardInterrupt

As you can see, the logs include Traceback, which allows us to detect a specific problem by adding information about where the exception was produced.

Any unexpected error should be logged as ERROR. Ideally, they should also be analyzed and the code changed to bugfix them or at least transform them into expected errors. Sometimes this is not feasible due to other pressing issues or a low occurrence of the problem, but some strategy should be implemented to make sure there's consistency in the handling of bugs.

A great tool to handle unexpected errors is Sentry (https://sentry.io/). This tool creates a trigger for each error on a lot of common platforms, including Python Django, Ruby on Rails, Node, JavaScript, C#, iOS, and Android. It aggregates the errors detected and allows us to work with them more strategically, which is sometimes difficult when just having access to the logs.

Sometimes, unexpected errors will present themselves with enough information about what the problem is, which could be related to an external problem like a network issue or a database problem. The solution may be located outside the realm of the service itself.

Log strategies

A common problem when dealing with logs is deciding on the appropriate severity for each of the individual services. Is this message a WARNING or an ERROR? Should this statement be added as an INFO message or not?

Most of the log severity descriptions have definitions, such as the program shows a potentially harmful situation or the application highlights the progress of the request. These are vague definitions and difficult to act on in a real-life situation. Instead of using these vague definitions, try to define each level in relationship with any follow-up action that should be taken if the issue is encoutered. This helps to clarify to the developers what to do when a given error log is found. For example: "Do I want to be informed each and every time this situation happens?"

The following table shows some examples of the different severity levels and what action could be taken:

Log level

Action to take

Comments

DEBUG

None.

Not tracked. Only useful while developing.

INFO

None.

INFO logs show generic information about the flow of the actions in the app to help track systems.

WARNING

Track the number of logs. Alert on raising levels.

WARNING logs track errors that are automatically fixed, like retries to connect to an external service, or fixable format errors in a database. A sudden increase may require investigation.

ERROR

Track the number of logs. Alert on raising levels. Review all errors.

ERROR logs track errors that can't be recovered. A sudden increase may require immediate action. All of them should be periodically reviewed to fix common occurrences and mitigate them, perhaps moving them to WARNING level.

CRITICAL

Immediate response.

CRITICAL logs indicate a catastrophic failure in the application. A single one indicates the system is not working at all and can't recover.

This sets clear expectations on how to respond. Note this is an example, and you may need to make tweaks and adjustments to adapt this to the needs of your specific organization.

The hierarchy of different severities is very clear, and in our example, there's an acceptance that there'll be a certain number of ERROR logs generated. For the development team's sanity, not everything needs to be fixed immediately, but a certain order and prioritization should be enforced.

In production situations, ERROR logs will typically be categorized from "we're doomed" to "meh." Development teams should actively either fix "meh" logs or stop the issue from being logged to remove noise from the monitoring tools. That may include lowering the level of logs if they aren't worth checking. You want as few ERROR logs as possible, so all of them can be meaningful.

Remember that ERROR logs will include unexpected errors that typically require a fix to either resolve the issue completely, or explicitly capture it and reduce its severity if it is not important.

This follow-up is definitely a challenge as applications grow, as the number of ERROR logs will increase significantly. It requires time to be spent on proactive maintenance. If this is not taken seriously and it is too often dropped for other tasks, it will compromise the reliability of the application in the medium term.

WARNING logs are indications that something may not be working as smoothly as expected, but things are under control, unless there's a sudden increase in the number of logs of this kind. INFO logs are just there to give context in the event of a problem, but can be ignored otherwise.

A common mistake is to generate ERROR logs in actions where there are incorrect input parameters, such as in web requests when a 400 BAD REQUEST status code is returned. Some developers will argue that a customer sending a malformed request is an error. But there's nothing that the developer team should do if the request is properly detected and returned. It's business as usual, and the only action may be to return a meaningful message to the requester so they can fix their request.

If this behavior persists in certain critical requests, like repeatedly sending a bad password, a WARNING log can be created. There's no point in creating an ERROR log when the application is behaving as expected.

In web applications, as a rule of thumb, ERROR logs should only be created when the status code is one of the 50X variants (like 500, 502, and 503). Remember that the 40X errors mean that the sender has a problem, while 50X means that the application has the problem, and it's your team's responsibility to fix it.

With common and shared definitions of log levels across the team, all engineers will have a shared understanding of error severity that will help shape meaningful actions toward improving the code.

Allow time for tweaking and adjusting any definition. It's also likely that you'll have to deal with logs created before the definition, which can require work. One of the biggest challenges in legacy systems is creating a proper logging system to categorize problems, as they'll likely be very noisy, making it difficult to distinguish the real problems from annoyances and even non-problems.

Adding logs while developing

Any test runner will capture logs and display it as part of the trace while running tests.

pytest, which we introduced in Chapter 10, Testing and TDD, will display logs as part of the result of a failing test.

This is a good opportunity to check that the expected logs are being generated while the feature is still in development phase, especially if it's done in a TDD process where the failing tests and errors are produced routinely as part of the process, as we saw in Chapter 10, Testing and TDD. Any test that checks an error should also add a corresponding log and, while developing the feature, check that they are being produced.

You can explicitly add to the test a check to validate that the log is being generated by using a tool like pytest-catchlog (https://pypi.org/project/pytest-catchlog/).

Typically, though, we just take a bit of care and incorporate the practice of checking while using TDD practices as part of the initial check that the test is failing. However, be sure that the developers understand why it's useful to have logs while developing to make the habit stick.

While developing, DEBUG logs can be used to add extra information about the code flow that would be excessive for production. In development, this extra information can help fill in the gaps between INFO logs and help developers to solidify the habit of adding logs. A DEBUG log may be promoted to INFO if, during tests, it's found to be useful in tracking problems in production.

Additionally, for special occasions, DEBUG logs can be enabled in production in controlled cases to track certain problems that are difficult to understand. Note that this has big implications on the number of generated logs, which can lead to storage problems. Be very cautious here.

Be sensible about the messages displayed in INFO and higher severity logs. In terms of information that's displayed, avoid sensitive data such as passwords, secret keys, credit card numbers, and personal information.

Keep an eye in production for any size limitations and how quickly logs are generated. Systems may experience a log explosion in situations when new features are generated, if the number of requests grows, or if the number of workers in the system is increased. These three situations can be produced when systems undergo growth.

It's always a good idea to double-check that the logs are being properly captured and available in different environments. All the configuration to ensure that the logs are properly captured may take a bit of time, so it's better to do this beforehand. This involves capturing unexpected errors and other logs in production and checking that all the plumbing is done correctly. The alternative is to discover that it's not working correctly only after stumbling into a real problem.

Log limitations

Logs are very useful to understand what's happening in a running system, but they have certain limitations that are important to understand:

  • Logs are only as good as their messages. A good, descriptive message is critical in making logs useful. Reviewing the log messages with a critical eye, and correcting them when needed, is important to save precious time on production problems.
  • Have an appropriate number of logs. Too many logs can confuse a flow, and too few may not include enough information to allow us to understand the problem. Large numbers of logs also create problems with storage.
  • Logs should work as an indication of the context of the problem, but likely won't pinpoint it. Trying to generate specific logs that fully explain a bug will be an impossible task. Instead, focus on showing the general flow and surrounding context of the action, so it can be replicated locally and debugged. For example, for a request, make sure to log both the request and its parameters so the situation can be replicated.
  • Logs allow us to follow the execution of a single instance. When grouped together using a request ID or similar, logs can be grouped by execution, allowing us to follow the flow of a request or task. However, logs don't directly display aggregated information. Logs answer the question, "what happened in this task?", but not "what is going on in the system?" For that kind of info, it's better to use metrics.

    There are tools available to create metrics based on logs. We will talk more about metrics in Chapter 13, Metrics.

  • Logs only work retrospectively. When a problem in a task is detected, logs can only show information that was prepared in advance. That's why it's important to analyze critically and refine the information, removing logs that are not useful and adding others with relevant contextual information to help replicate the problem.

Logs are a fantastic tool, but they need to be maintained to ensure that they can be used to detect bugs and problems and allow us to take action as efficiently as possible.

Summary

In this chapter, we started by presenting the basic elements of logs. We defined how logs contain messages plus some metadata like a timestamp, and considered the different severity levels. We also described the need to define request IDs to group logs related to the same task. We also discussed how, in the Twelve-Factor App methodology, logs should be sent to stdout to detach log generation from the process of handling and routing them to the proper destination to allow the collection of all logs in the system.

We then showed how to produce logs in Python using the standard logging module, describing the three key elements of the logger, the handler, and the formatter. Next, we showed the two different errors that can be produced in a system: expected, understood as errors that were foreseen as possible and are handled; and unexpected, meaning those that were not foreseen and occurred out of our control. We then went through the different strategies and cases for these.

We described the different severities and how to generate a strategy for what actions should be taken when a log of a certain severity is detected, instead of categorizing the logs in terms of "how critical they are", which ends up generating vague guidelines and not being very useful.

We discussed several habits to improve the usefulness logs by including them in development in a TDD workflow. This allows developers to consider the information presented in logs while writing tests and producing errors, which presents the perfect opportunity to ensure that the logs generated work correctly.

Finally, we discussed the limitations of logs and how we can deal with them.

In the next chapter, we will look at how to work with aggregated information to find out the general state of the system through the usage of metrics.

Join our book’s Discord space

Join the book’s Discord workspace for a monthly Ask me Anything session with the authors:

https://packt.link/PythonArchitechture

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.174.44