12

Observing Application Behavior and Performance

With every new version of our software, we feel that release thrill. Did we manage to finally fix all those pesky problems we've been working on lately? Will it work or will it break? Will users be satisfied, or will they complain about new bugs or performance issues?

We usually employ various quality assurance techniques and automated testing methodologies to increase our confidence in software quality and validity. But these techniques and methodologies just increase our expectation that things will run smoothly with each new release. But how can you make sure that an application is running smoothly for your users? Or conversely, how can you know if something is going wrong?

In this chapter we will discuss the topic of application observability. Observability is a property of a software system that allows you to explain and understand the application's state based on its outputs. If you know the state of the system and understand how it got there, you will know if the state is correct. By employing various observability techniques, we will learn about:

  • Capturing errors and logs
  • Instrumenting code with custom metrics
  • Distributed application tracing

Most observability techniques can be applied to both desktop applications installed on users' own computers and distributed systems running on remote servers or cloud services. However, in the case of desktop applications, observability options are often limited due to privacy concerns. That's why we will focus mainly on observing the behavior and performance of code running on your own infrastructure.

Great observability cannot be achieved without proper tools, so let's first consider the technical requirements for this chapter.

Technical requirements

The following are the Python packages that are mentioned in this chapter that you can download from PyPI:

  • freezegun
  • sentry-sdk
  • prometheus-client
  • jaeger-client
  • Flask-OpenTracing
  • redis_opentracing

Information on how to install packages is included in Chapter 2, Modern Python Development Environments.

The code files for this chapter can be found at https://github.com/PacktPublishing/Expert-Python-Programming-Fourth-Edition/tree/main/Chapter%2012.

Capturing errors and logs

Standard output is the cornerstone of observability. That's because one of the simplest things every application can do is to read information from standard input and print information to standard output. That's why the first thing every programmer learns is usually how to print "Hello world!"

Despite the fact that standard input and output are so important, users of modern software rarely know anything about their existence. Desktop applications usually aren't invoked from the terminal and users often interact with them using graphical interfaces instead.

Web-based software usually runs on remote servers and users interact with it using web browsers or custom client software. In both cases, standard input and output are hidden from the user.

But although users don't see standard output, that doesn't mean it does not exist. Standard output is often used for logging detailed information about internal application state, warnings, and errors that happen during program execution. Moreover, standard output can be easily directed to a filesystem for easy storage and later processing. The simplicity and versatility of standard output make it one of the most flexible observability enablers. It also provides the most basic way to capture and inspect information about error details.

Although you could handle application output using bare print() function calls, good logging requires some consistent structure and formatting. Python comes with a built-in logging module that provides a basic yet powerful logging system. Before we dive deeper into good practices for capturing logs and errors, let's get to grips with some of the basics of the Python logging system.

Python logging essentials

The usage of the logging module is pretty straightforward. The first thing to do is to create a named logger instance and configure the logging system:

import logging
logger = logging.getLogger("my_logger")
logging.basicConfig()

A common idiom when defining loggers is to use module name as the logger's name:

logger = logging.getLogger(__name__)

This pattern helps in managing logger configuration in a hierarchical way. See the Logging configuration section for more details on configuring loggers.

Every logger is named. If you don't provide the name argument, logging.getLogger() will return a special "root" logger that serves as a basis for other loggers. logging.basicConfig() allows the specification of additional logging options like logging capture level, message formatters, and log handlers.

Using your logger instance, you can now record log messages at a specified log level using either the log() method:

logger.log(logging.CRITICAL, "this is critical message")

Or various convenience methods associated with specific log levels:

logger.error("This is info message")
logger.warning("This is warning message")

The log level is a positive integer value where greater values correspond to messages of higher importance. If logger is configured with a specific level, it will ignore messages of lower value. Python logging provides the following preconfigured levels with their respective integer values:

  • CRITICAL and FATAL (50): Reserved for messages indicating errors due to which the program is unlikely to be able to continue its operation. A practical example would be exhaustion of resources (like disk space) or an inability to connect with a crucial backing service (a database, for instance).
  • ERROR (40): Reserved for messages indicating serious errors that make the program unable to perform a specific task or function. A practical example would be an inability to parse user input or an intermediary network timeout.
  • WARNING or WARN (30): Reserved for messages indicating abnormal situations that the program has recovered from or situations that may lead to more serious problems in the near future. A practical example would be fixing malformed user input using a fallback value or an indication of low disk space.
  • INFO (20): Reserved for messages confirming that the program works as expected. It can be used, for instance, to output details about successful operations performed during program execution.
  • DEBUG (10): Reserved for very detailed debugging messages, allowing us to trace application behavior during debugging sessions.
  • NOTSET (0): A pseudo-level that captures all possible log levels.

As you can see, the default logging level values are defined in intervals of 10. This allows you to define custom levels in between existing levels if you need finer granularity.

Every predefined log level except NOTSET has a dedicated convenience method in the logger instance:

  • critical() for the CRITICAL and FATAL levels
  • error() or exception() for the ERROR level (the latter automatically prints a traceback of the currently captured exception)
  • warning() for the WARNING and WARN levels
  • info() for the INFO level
  • debug() for the DEBUG level

The default configuration captures messages up to the logging.WARNING level and outputs them to standard output. The default format includes a textual representation of the error level, the logger name, and a message, as in the following example:

ERROR:my_logger:This is error message
WARNING:my_logger:This is warning message
CRITICAL:my_logger:This is critical message

The format and output of loggers can be modified by specifying two types of logging components: log handlers and formatters.

Logging system components

The Python logging system consists of four main components:

  • Loggers: These are the entry points of the logging system. Application code uses loggers to produce messages for the logging system.
  • Handlers: These are the receivers of the logging system. Handlers are attached to loggers and are supposed to emit information to the desired destination (usually outside of the application). One logger can have multiple handlers.
  • Filters: These allow loggers or handlers to reject specific messages based on their contents. Filters can be attached to either loggers or handlers.
  • Formatters: These transform raw logging messages into the desired format. Formatters can produce either human- or machine-readable messages. Formatters are attached to message handlers.

In essence, Python log messages go in one direction—from the application to the desired destination through loggers and handlers (see Figure 12.1). Both loggers and filters can terminate message propagation either through the mechanism of filters or by specifying the enabled=False flag.

Figure 12.1: Topology of the Python logging system

The basic configuration of the logging system enabled with an empty logging.basicConfig() sets up the following hierarchy of components:

  • A root logger with the logging.WARNING level: All messages with a severity level lower than logging.WARNING will be ignored by default.
  • A single console handler attached to the root logger: This emits log messages to the standard error stream.
  • A simple formatter with "%(levelname)s:%(name)s:%(message)s" style: Every message emitted to the given output will contain the severity level name, the name of the logger, and some plain message text separated by a colon.

These default settings are a good starting point. The simple format allows for easy parsing and the standard console handler will work well with every console application. However, some applications require more structured message representation and/or the handling of custom message destinations. Such customization can be achieved by overriding the log handlers and formatters.

The standard logging library comes with three built-in logging handlers:

  • NullHandler: This handler does nothing. It can be used similarly as a /dev/null device on POSIX systems to discard all messages from the logger. It is often used as a default handler in libraries that use the Python logging system. In such a case, the library user is responsible for overriding the default NullHandler using their own logging configuration.
  • StreamHandler: This handler emits messages to a given stream (a file-like object). If no stream is provided, StreamHandler will by default emit messages to the sys.stderr stream. After every message, the stream object will be flushed if it supports the flush() method.
  • FileHandler: This handler is a subclass of StreamHandler that emits messages to the file specified by the filename argument. It will take care of opening and closing the file. By default, it uses the append mode of a file and default system encoding.

This is a very modest set of log handlers, but there's a logging.handlers module that offers over a dozen more advanced handlers. The following are the most important ones:

  • RotatingFileHandler: This is a file handler that can rotate the logfiles whenever the current logfile exceeds a certain size limit. This is a good extension of the basic FileHandler that is especially useful for verbose applications that can produce a lot of log messages in a short period of time. RotatingFileHandler can be configured to keep only a specified number of past log backup files and thus reduces the risk of logfiles overflowing the disk.
  • TimedRotatingFileHandler: This is similar to RotatingFileHandler but performs logfile rollovers at specified time intervals instead of monitoring the logfile size. It can be used to avoid disk overflow for applications that produce log messages at a rather constant and predictable pace. The advantage of TimedRotatingFileHandler is that past logfiles can be easily browsed by the date of their creation. The default interval of TimedRotatingFileHandler is one hour.
  • SysLogHandler: This handler emits logs to the syslog server. Syslog is a popular logging standard and many Linux distributions come with a syslog server running locally by default. Many applications and services support logging to the syslog server, so SysLogHandler can be used to unify log collections across many programs running on the same host. Also, using syslog allows you to offload the responsibility of logfile rotation and compaction to a single system logging facility.
  • SMTPHandler: This log handler emits a single SMTP email for every log message. It is commonly used together with the logging.ERROR severity level to deliver information about errors and exceptions to a specified email address. Such errors can be later reviewed at any time in the recipient's inbox. Using SMTPHandler is one of the simplest methods of error monitoring and allows programmers to be notified about issues even if they don't have direct access to the application runtime environment.
  • SocketHandler, DatagramHandler, and HTTPHandler: These are simple network handlers that allow you to deliver log messages over the network. SocketHandler emits log messages over a socket connection, DatagramHandler emits logs messages as UDP datagrams, and HTTPHandler emits logs messages as HTTP requests. These handlers can be used to build your own custom distributed log delivery mechanism, although it will work well only for low volumes of logs that do not require good delivery guarantees. For larger log volumes, it is advised to use a specialized log delivery mechanism such as syslog or another modern distributed log system.

We will discuss an example of a modern distributed log system in the Distributed logging section.

To see how one of those specialized log handlers works, let's assume we are building a small desktop application. Let's pretend our program runs with a graphical user interface and the user usually does not run it from the shell. We will be logging information about warnings and errors discovered during program execution. If there is a problem, the user will be able to find the logfile and send us a reference to it.

We don't know how many messages our application will produce. In order to avoid disk overflow, we will provide a rotating file handler. Files will be rotated daily, and we will keep a history of the last 30 days. Our logging handler configuration can be as simple as this:

import logging.handlers
from datetime import timedelta, datetime
root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
root_logger.addHandler(
    logging.handlers.TimedRotatingFileHandler(
        filename="application.log",
        when="D",
        backupCount=30,
    )
)

The logging.getLogger() call (without a specific logger name) allows us to obtain a special root logger. The purpose of this logger is to provide default configuration and handlers for other loggers that do not have specific handlers attached. If the logger does not have its own handlers, its messages will be automatically propagated to the parent logger.

We will learn more about logger hierarchy and log message propagation in the Logging configuration section.

Once we have access to the root logger, we are able to provide the default configuration. The root_logger.setLevel(logging.INFO) call makes sure that the logger will emit only messages with a severity level greater than or equal to logging.INFO. That's quite a verbose setting. If you don't use custom log levels, the only more verbose setting would be logging.DEBUG.

We didn't configure the logging system using the logging.basicConfig() function so our root logger doesn't have a default handler. We add a TimedRotatingFileHandler() instance to the root logger using the addHandler() method. The when="D" argument specifies a daily rollover strategy and backupCount specifies the number of logfile backup files we want to keep on disk.

You probably don't have a full month of time to run the application just to see how rotated logfiles pile up on the disk. To speed up the process, we will use a neat trick and employ the freezegun package to trick Python into thinking that time is passing at an increased pace.

We used the freezegun package previously to test time-dependent code in Chapter 10, Testing and Quality Automation.

The following code simulates an application that produces one message every hour but speeds up the process by 36,000:

from datetime import timedelta, datetime
import time
import logging
import freezegun
logger = logging.getLogger()
def main():
    with freezegun.freeze_time() as frozen:
        while True:
            frozen.tick(timedelta(hours=1))
            time.sleep(0.1)
            logger.info(f"Something has happened at {datetime.now()}")
if __name__ == "__main__":
    main()

If you run the application in your shell, you won't see any output. But if you list all the files in the current directory, you will see that after few seconds, new logfiles start appearing:

$ ls -al
total 264
drwxr-xr-x  35 swistakm  staff  1120  8 kwi 00:22 .
drwxr-xr-x   4 swistakm  staff   128  7 kwi 23:00 ..
-rw-r--r--   1 swistakm  staff   583  8 kwi 00:22 application.log
-rw-r--r--   1 swistakm  staff  2491  8 kwi 00:21 application.log.2021-04-07
-rw-r--r--   1 swistakm  staff  1272  7 kwi 23:59 application.log.2021-04-08

The application.log file is the current logfile and files ending with a date are historical backup files. If you let your program spin for a little longer, you will see that the number of historical backup files never exceeds 30. After some time, TimedRotatingFileHandler will start replacing old backups with new ones.

In our example, we have configured the log without any formatters. In such cases, the handler will emit messages as is and without any extra information. Here's a sample taken from the most recent logfile:

Something has happened at 2021-04-08 17:31:54.085117
Something has happened at 2021-04-07 23:32:04.347385
Something has happened at 2021-04-08 00:32:04.347385
Something has happened at 2021-04-08 01:32:04.347385

It is missing some important context. We see the date of the message only because we included it in the log message, but we are missing information about the logger from which messages are originating and information about message severity. We can customize the output from a specific log handler by attaching a custom message formatter.

Formatters are attached to log handlers using the setFormatter() method of the handler object. The formatter should be a logging.Formatter() instance that accepts four initialization arguments:

  • fmt: This is a string formatting pattern for the output message. A string formatting pattern can reference any of the attributes of the logging.LogRecord class. It defaults to None, which is a plain text log message without any formatting.
  • datefmt: This is a date formatting pattern for representing the message timestamp. It accepts the same formatting directives as the time.strftime() function. It defaults to None, which translates to an ISO8601-like format.
  • style: This specifies the string formatting style used for the fmt argument. It can be either '%' (percent formatting), '{' (str.format() formatting), or '$' (string.Template formatting). It defaults to '%'.
  • validate: This specifies whether to validate the fmt formatting argument against the style argument. It defaults to True.

We can, for instance, provide a custom formatter that includes the severity message time, level name, logger name, and exact line of the log call along with the emitted message. Our logging setup would then look as follows:

root_logger = logging.getLogger()
root_logger.setLevel(logging.INFO)
formatter = logging.Formatter(
    fmt=(
        "%(asctime)s | %(levelname)s | "
        "%(name)s | %(filename)s:%(lineno)d | "
        "%(message)s"
    )
)
handler = logging.handlers.TimedRotatingFileHandler(
    filename="application.log",
    when="D",
    backupCount=30,
)
handler.setFormatter(formatter)
root_logger.addHandler(handler)

You can learn more about the available LogRecord attributes that can be referenced with custom formatters at https://docs.python.org/3/library/logging.html#logrecord-attributes.

With such a configuration, a message in our logfile would look as follows:

2021-04-08 02:03:50,780 | INFO | __main__ | logging_handlers.py:34 | Something has happened at 2021-04-08 00:03:50.780432

The choice of whether to use a specialized log handler usually depends on multiple factors, like the target operating system, the existing system logging infrastructure in your organization, the expected volume of logs, and the deployment methodology. Standalone applications, system services, or Docker containers usually log differently, and you need to take that into account. For instance, having historical logfiles easily accessible is definitely a practical thing, but managing a collection of thousands of logfiles in a large distributed system consisting of tens or hundreds of hosts can be a real hurdle. On the other hand, you can't expect end users of standalone applications to run a distributed logging system on their own personal computer just to use your application.

The same applies to log message formatting. Some log collection and processing systems can take advantage of structured message formats such as JSON, msgpack, or avro. Others will be able to parse and extract semantic elements of the message using customized parsing rules. Plaintext messages are easier to inspect and understand for humans but are harder to process using specialized log analysis software. Structured log messages are easier to process by machines but are harder to read with the naked eye.

Regardless of your current needs, you can be sure that your logging choices and preferences will change over time. That's why providing a logging setup by manually creating handlers and formatters is rarely a convenient method.

We've already learned about the logging.basicConfig() function. It sets up a reasonable logging default but also allows you to provide some shortcuts for defining the message format or the default handler selection. Unfortunately, it works only at the root logger level and doesn't allow you to target other loggers. In the next section we will learn about alternative logging configuration methods that allow you to define complex logging rules for applications of any size.

Logging configuration

Using the root logger is a convenient method of defining the top-level configuration of your application logging system. Unfortunately, you will quickly learn that some libraries often use the Python logging module to emit information about important events happening inside them. At some point, you may find it necessary to fine-tune the logging behavior of those libraries as well.

Usually, libraries use their own logger names. The common pattern for naming loggers is to use a special __name__ attribute that contains the fully qualified name of a module:

import logging
logger = logging.getLogger(__name__)

For instance, the __name__ attribute inside the utils sub-module of the package acme will be acme.utils. If acme.utils defines its logger as logging.getLogger(__name__) then the name of that logger will be 'acme.utils'.

When you know the name of the logger, you can obtain it at any time and provide a custom configuration. Common use cases are:

  • Silencing the logger: Sometimes you are not interested in messages coming from a library at all. In your main application file, you can find a logger and silence it with the disabled attribute as in the following example:
    acme_logger = logging.getLogger("acme.utils")
    acme_logger.disabled = True
    
  • Overriding handlers: Libraries should not define their own handlers; that should be the responsibility of the library user. Anyway, not every programmer is aware of good logging practices and it happens sometimes that a useful third-party package comes with a logger that has an attached handler. It will effectively ignore your root logger configuration.

    You can use such logger and override handlers as in the following example:

    acme_logger = logging.getLogger("acme.utils")
    acme_logger.handlers.clear()
    
  • Changing logger verbosity: It may happen that some loggers are too verbose. Sometimes warnings inside a library are not severe enough to be included in the main application log. You can find a logger and override its severity level:
    acme_logger = logging.getLogger("acme.utils")
    acme_logger.setLevel(logging.CRITICAL)
    

Using per-module loggers is also useful in non-library code. It is good practice to use per-module loggers in larger applications consisting of multiple sub-packages or sub-modules. This will allow you to easily fine-tune the verbosity and manage the handlers of multiple loggers. And the key in consistently managing Python logging configuration is understanding the parent-child relationship of loggers.

Whenever you create a logger with a name that contains periods, the logging module will actually build a hierarchy of loggers. If you, for instance, try to obtain a logger using logging.getLogger("acme.lib.utils"), the logging module will perform the following steps:

  1. It will first search for a logger registered under "acme.lib.utils" in a thread-safe manner. If it does not exist, it will create a fresh logger and register it under "acme.lib.utils".
  2. If a new logger has been created, it will iteratively remove the last segments from the logger name and search for a new name until it finds a registered logger. If the name doesn't have a logger registered under it, the logging module will register special placeholder object. For "acme.lib.utils", it will search first for "acme.lib" and then for "acme". The first non-placeholder logger will become the parent of "acme.lib.utils".
  3. If there are no non-placeholder loggers above the newly created logger, the root logger becomes the parent of this logger.

Moreover, the logging module ensures that existing placeholders are actively replaced with proper loggers the first time they are explicitly accessed with the logging.getLogger() function. In such a case, the parent-child relationships are retroactively updated. Thanks to this, concrete loggers can be configured in any order and regardless of their hierarchy.

This parent-child relationship plays a role in the evaluation of logger handlers. Whenever you log a new message through a specific logger, handlers are invoked according to the following rules:

  1. If a logger has its own handlers, a message is passed to every handler:
    • If the logger has the propagate attribute set to True (default value), the message is propagated to the parent logger.
    • If the logger has the propagate attribute set to False, processing stops.
  2. If a logger does not have its own handlers, a message is passed to the parent logger

Best practice is to define log handlers only at the top-level root logger. Otherwise, it will be hard to track all propagation rules and make sure that every message is logged only once. But sometimes specifying handlers on lower-level (per-module) loggers can be useful for applying special treatment to a very specific category of errors. For instance, your application may generally log using the default console handler, but if there is a business-critical module, you may want to attach SMTPHandler. That way you will be sure that all log messages coming from that module will be additionally delivered to you as SMTP emails.

The logging hierarchy is also useful for controlling the verbosity of whole logger groups. For example, if the acme package contains multiple sub-loggers and none of them has handlers attached, you can disable the whole "acme" logger to silence every sub-logger.

Complex hierarchies can be intimidating, especially if you need to perform fine-tuning of loggers living in multiple modules. Creating individual loggers, handlers, and formatters using Python may sometimes require a substantial amount of boilerplate. That's why the logging.config module offers two functions that allow you to configure the whole Python logging system in a more declarative way:

  • fileConfig(): This takes the path of an INI-like configuration file. The syntax of that file is the same as the syntax of configuration files handled with the built-in configparser module.
  • dictConfig(): This takes a dictionary of configuration values.

    You can find more information about Python logging configuration options at https://docs.python.org/3/library/logging.config.html.

Both configuration ways assume similar configuration sections and options. The only difference is the format. The following is an example of the configuration file for logging with time-based file rotation:

[formatters]
keys=default
[loggers]
keys=root
[handlers]
keys=logfile
[logger_root]
handlers=logfile
level=INFO
[formatter_default]
format=%(asctime)s | %(levelname)s | %(name)s | %(filename)s:%(lineno)d | %(message)s
[handler_logfile]
class=logging.handlers.TimedRotatingFileHandler
formatter=default
kwargs={"filename": "application.log", "when": "D", "backupCount": 30}

And the following is the same configuration defined with use of the dictConfig() function:

logging.config.dictConfig({
    "version": 1,
    "formatters": {
        "default": {
            "format": (
                "%(asctime)s | %(levelname)s | "
                "%(name)s | %(filename)s:%(lineno)d | "
                "%(message)s"
            )
        },
    },
    "handlers": {
        "logfile": {
            "class": "logging.handlers.TimedRotatingFileHandler",
            "formatter": "default",
            "filename": "application.log",
            "when": "D",
            "backupCount": 30,
        }
    },
    "root": {
        "handlers": ["logfile"],
        "level": "INFO",
    }
})

With so many formatting options and a hierarchical logger structure, logging in Python can be quite complex. But it rarely has to be complex. You can reduce the complexity of logging by following some good practices explained in the next section.

Good logging practices

Python provides a flexible and powerful logging system. With so many configuration options, it is really easy to get tangled in unnecessary complexity. But in fact, logging should be as simple as possible. It is, in the end, the first and foremost tool to use to understand how your application works. You won't be able to understand how your code works if you're spending countless hours trying to understand what's wrong with your logging.

The key to good logging is to follow good practices. The following are common rules for efficient and simple logging in Python:

  • Use per-module loggers: By using logging.getLogger(__name__) to create new loggers, you make it simple to control the logging behavior of whole module trees. This practice is especially useful for libraries because it allows developers to tune library logging from within the main application logging configuration.
  • Use one event per line: If you use text-based outputs (stdout/stderr streams, files), a single log message should preferably fit a single line of text. This allows you to avoid most text buffering issues and the mixing of messages coming from multiple threads or processes logging to the same output. If you really need to fit multiline text into a message (for instance, for an error stack trace), you can still do that by leveraging structured log formats like JSON.
  • Offload work to system logging facilities or a distributed logging system: Log compaction, forwarding, or logfile rotation can all be done in Python code using custom log handlers, but that is rarely a wise choice. It is usually better to use a system logging facility like syslog or a dedicated log processing tool like logstash or fluentd. Dedicated tools will usually do these things better and more consistently.

    They will also remove a lot of log processing complexity from your own code and will reduce problems with logging in concurrent applications.

  • If possible, log directly to standard output or error streams: Writing to stdout or stderr is one of the most basic things every application can do. This is true for writing to files, but not every environment will have a writable or persistent filesystem. If you need logs stored in files, you can simply use shell output redirection. If you need your logs delivered over the network, you can use dedicated log forwarders like logstash or fluentd.
  • Keep log handlers at the root logger level: Defining handlers and formatters for specific loggers other than the root logger makes a logging configuration unnecessarily complex. It is usually better to define a single console handler at the root level and leave the rest of the log processing logic to external logging utilities.
  • Avoid the urge to write a custom distributed logging system: Reliably delivering logs over a network in a large distributed system isn't a simple thing. It is usually better to leave this task to dedicated tools like syslog, fluentd, or logstash.
  • Use structured log messages: As the volume of logs grows, it becomes harder and harder to extract meaningful information from them. Dedicated log processing systems allow you to search through large volumes of textual information using various queries and even do sophisticated analytics of historical log events. Most of these systems can perform efficient parsing of plain textual log messages, but they always work better if they have access to structured log messages from the very beginning. Common structured message formats for logs are JSON and msgpack.

    Python's logging and logging.handlers modules lack a dedicated handler for structured log messages. python-json-logger is a popular package from PyPI that provides a formatter capable of emitting messages in JSON format. You can learn more about python-json-logger at https://github.com/madzak/python-json-logger.

    Another popular package is structlog. It extends the Python logging system with various utilities for capturing log context, processing messages, and outputting them in various structured formats. You can learn more about the structlog package at https://www.structlog.org/en/stable/index.html.

  • Configure logging in only one place: Logging configuration (understood as a collection of handlers and formatters) should be defined only in one part of your application. Preferably, that's the main application entrypoint script, like the __main__.py file or a WSGI/ASGI application module (for web-based applications).
  • Use basicConfig() whenever possible: If you follow most of the previous rules, the basicConfig() function is all that you need to perform a complete logging configuration. It works at the root logger level and by default defines the StreamHandler class attached to the stderr stream. It also allows for the convenient configuration of date and log message format strings and usually, you won't need anything more beyond that. If you want to use structured log messages you can easily use your own handler with a custom message formatter.
  • Prefer dictConfig() over fileConfig(): The syntax of logging configuration files supported by fileConfig() is a bit clunky and way less flexible than dictConfig(). It is, for instance, a common pattern to control the verbosity of application logging through command-line arguments or environment variables. Such functionality is definitely easier to implement through a dictionary-based logging configuration than through a file-based configuration.

It may seem like a lot of rules, but generally, they are all about keeping things simple and not overengineering your logging system. If you keep the configuration simple, log to standard output, and use reasonable log formatting, you are usually set for success.

Good logging hygiene will prepare you for handling and processing arbitrarily large volumes of logs. Not every application will generate a lot of logs, but those that do generally need the support of dedicated log forwarding and processing systems. These systems are usually capable of handling distributed logs coming from hundreds or even thousands of independent hosts.

Distributed logging

With a single service or program running on a single host, you can easily go with a simple logging setup based on rotated logfiles. Almost every process supervision tool (like systemd or supervisord) allows for the redirection of stdout and stderr output to a specified logfile, so you don't have to open any files in your application. You also can easily offload the responsibility of compacting and rotating historical logfiles to a system utility like logrotate.

Simple things will work well only for small applications. If you run multiple services on the same host, you will eventually want to handle logs of various programs in a similar way. The first step to organize logging chaos is to use a dedicated system logging facility like syslog. It will not only digest logs in a consistent format but will also provide command-line utilities to browse and filter past logs.

Things become more complex once you scale your application to run in a distributed manner across many hosts. The main challenges of distributed logging are the following:

  • Large volumes of logs: Distributed systems can produce enormous amounts of logs. The volume of logs grows with the number of processing nodes (hosts). If you want to store historical logs, you will need a dedicated infrastructure to keep all this data.
  • The need for centralized log access: It is really impractical to store logs only on the hosts that produce them. When you know that something is going wrong with your application, the last thing you want to do is to jump from host to host to find a logfile that has crucial information about the issue you're experiencing. What you need is a single place where you can access all logs available in your system.
  • Unreliable networks: Networks are unreliable by nature, and if you are sending data over one, you need to be prepared for temporary connection failures, network congestion, or unexpected communication delays, which may happen in any distributed system. With centralized log access, you will need to deliver logs from each individual host to a dedicated logging infrastructure. To avoid any data loss, you will need specialized software that can buffer log messages and retry delivery in the case of network failures.
  • Information discoverability and correlation: With large volumes of logs, it will be hard to find useful information in the sea of log messages of various severity. You will eventually need a way to perform custom queries against your log dataset that allows you to filter messages by their sources, logger names, severity levels, and textual content. Also, issues occurring in distributed systems are often very intricate. In order to understand what is really going on in your system, you will need tools that allow you to aggregate information from various log streams and perform statistical data analysis.

The complexity of the logging infrastructure depends mainly on the scale of your application and the observability needs you have. Depending on the capabilities of the logging infrastructure, we can define the following logging maturity model:

  • Level 0 (snowflake): Every application is a unique "snowflake." Each handles all extra logging activities, like storage, compaction, archiving, and retention, on its own. Level 0 offers almost no extra observability capabilities beyond the ability to open historical logs in a text editor or shell utility.
  • Level 1 (unified logging): Every service on the same host logs messages in a similar way to a system daemon, known destination, or disk. Basic log processing tasks like compaction or retention are delegated to system logging facilities or common utilities like logrotate. To view logs from a particular host, the developer needs to "shell into" the host or communicate with the dedicated service daemon running on that host.
  • Level 2 (centralized logging): Logs from every host are delivered to a dedicated host or a cluster of hosts to prevent data loss and for archival storage. Services may need to deliver logs to a specific destination (a logging daemon or a location on disk). It is possible to view a slice of all logs in a given time frame, sometimes with advanced filtering, aggregation, and analytics capabilities.
  • Level 3 (logging mesh): Services are completely unaware of the logging infrastructure and can log directly to their stdout or stderr streams. A logging mesh working on every host is capable of automatically discovering new log streams and/or logfiles from every running service. Upon the deployment of a new application, its output streams will be automatically included in the logging infrastructure. A centralized log access infrastructure is capable of performing sophisticated queries over arbitrarily large time spans. It is possible to run log aggregation and analytics on stored messages. Logging information is available in near real time.

Centralized logging and logging mesh infrastructures can offer similar capabilities with regards to log filtering, aggregation, and analytics. The key differentiator here is the ubiquity of logging in logging mesh infrastructures. In Level 3 logging architectures, everything that runs your infrastructure and outputs information on standard output or error streams is automatically a source of logging information and will be automatically available in a centralized log access system.

There are many companies offering Level 2 or Level 3 logging infrastructures as a service. A popular example is AWS CloudWatch, which is well integrated with other AWS services. Other cloud providers also offer alternative solutions. If you have enough time and determination, you can also build a full-fledged Level 2 or Level 3 logging infrastructure from scratch using open-source tools.

A popular open-source choice for building full-fledged logging infrastructures is the Elastic Stack. This is a software ecosystem composed of multiple components: Logstash/Beats (log and metric ingestors/collectors), Elasticsearch (document store and search system), and Kibana (frontend application and dashboard for the Elastic Stack). You can learn more about the Elastic Stack at https://www.elastic.co/.

Capturing errors for later review

Logging is the most popular choice for storing and discovering information about errors and issues happening in the system. Mature logging infrastructures offer advanced correlation capabilities that allow you to have more insight into what is happening in the whole system at any given time. They can often be configured to alert on error message occurrence, which enables fast response times in situations of unexpected failure.

But the main problem of tracking errors using ordinary logging infrastructures is that they provide only a limited overview of the full context of the error occurrence. Simply put, you will only have access to the information that was included in a logging call.

In order to fully understand why a particular problem has occurred, you will need more information than was initially included in the logging call. With a traditional logging infrastructure, you would have to modify the code with additional logging, release the new version of the application, and wait for the error to occur again. If you missed something, you would have to repeat the process over and over again.

Another problem is the filtering of meaningful errors. It is not uncommon for large distributed systems to log tens of thousands of error messages or warnings daily. But not every error message has to be acted upon immediately. Teams doing maintenance often perform a bug triaging process to estimate the impact of an error. Usually, you concentrate on issues that occur very often or those that happen in a critical part of your application. For efficient triaging, you will need a system that is at least capable of deduplicating (merging) error events and assessing their frequency. Not every logging infrastructure will be able to do that efficiently.

That's why for tracking errors in your application, you should usually use a dedicated error tracking system that can work independently from your logging infrastructure.

One of the popular tools that gives a great error tracking experience is Sentry. It is a battle-tested service for tracking exceptions and collecting crash reports. It is available as open source, written in Python, and originated as a tool for backend Python developers. Now, it has outgrown its initial ambitions and has support for many more languages, including PHP, Ruby, and JavaScript, Go, Java, C, C++, and Kotlin. It still remains one of the most popular error tracking tools for many Python web developers.

You can learn more about Sentry at https://sentry.io/.

Sentry is available as a paid software-as-a-service model, but it is also an open-source project, so it can be hosted for free on your own infrastructure. The library that provides integration with Sentry is sentry-sdk (available on PyPI). If you haven't worked with it yet and want to test it without having to run your own Sentry server, then you can easily sign up for a free trial at the Sentry service site. Once you have access to a Sentry server and have created a new project, you will obtain a string called a Data Source Name (DSN) string. This DSN string is the minimal configuration setting needed to integrate your application with Sentry. It contains the protocol, credentials, server location, and your organization/project identifier in the following format:

'{PROTOCOL}://{PUBLIC_KEY}:{SECRET_KEY}@{HOST}/{PATH}{PROJECT_ID}'

Once you have your DSN, the integration is pretty straightforward, as shown in the following code:

import sentry_sdk
sentry_sdk.init(
    dsn='https://<key>:<secret>@app.getsentry.com/<project>'
)

From now on, every unhandled exception in your code will be automatically delivered to the Sentry server using the Sentry API. The Sentry SDK uses the HTTP protocol and delivers errors as compressed JSON messages over a secure HTTP connection (HTTPS). By default, messages are sent asynchronously from a separate thread to limit the impact on application performance. Error messages can then be later reviewed in the Sentry portal.

Error capture happens automatically on every unhandled exception, but you can also explicitly capture exceptions as in the following example:

try:
    1 / 0
except Exception as e:
    sentry_sdk.capture_exception(e)

The Sentry SDK has numerous integrations with the most popular Python frameworks, such as Django, Flask, Celery, and Pyramid. These integrations will automatically provide additional context that is specific to the given framework. If your web framework of choice does not have dedicated support, the sentry-sdk package provides generic WSGI middleware that makes it compatible with any WSGI-based web server, as shown in the following code:

from sentry_sdk.integrations.wsgi import SentryWsgiMiddleware
sentry_sdk.init(
    dsn='https://<key>:<secret>@app.getsentry.com/<project>'
)
# ...
application = SentryWsgiMiddleware(application)

Exceptions in Python web applications

Commonly, web applications do not exit on unhandled exceptions because HTTP servers are obliged to return an error response with a status code from the 5XX group if any server error occurs. Most Python web frameworks do such things by default. In such cases, the exception is, in fact, handled either on the internal web framework level or by the WSGI server middleware. Anyway, this will usually still result in the exception stack trace being printed (usually on standard output). The Sentry SDK is aware of the WSGI conventions and will automatically capture such exceptions as well.

The other notable integration is the ability to track messages logged through Python's built-in logging module. Enabling such support requires only the following few additional lines of code:

import logging
import sentry_sdk
from sentry_sdk.integrations.logging import LoggingIntegration
sentry_logging = LoggingIntegration( 
    level=logging.INFO,
    event_level=logging.ERROR,
)
sentry_sdk.init(
    dsn='https://<key>:<secret>@app.getsentry.com/<project>',
    integrations=[sentry_logging],
)

The capturing of logging messages may have caveats, so make sure to read the official documentation on the topic if you are interested in such a feature. This should save you from unpleasant surprises.

If you decide to use Sentry or any other similar service, carefully consider the "build versus buy" decision. As you may already heard, "there ain't no such thing as a free lunch." If you decide to run an error tracking service (or any utility system) on your own infrastructure, you will eventually pay additional infrastructure costs. Such an additional system will also be just another service to maintain and update. Maintenance = additional work = costs!

As your application grows, the number of exceptions grows, so you will be forced to scale Sentry (or any other system) as you scale your product. Fortunately, Sentry is a very robust project, but it will not give you any value if it's overwhelmed with too much load. Also, keeping Sentry prepared for a catastrophic failure scenario, where thousands of crash reports can be sent per second, is a real challenge. So, you must decide which option is really cheaper for you and whether you have enough resources to do all of this by yourself.

There is, of course, no "build versus buy" dilemma if security policies in your organization deny sending any data to third parties. If so, just host Sentry or similar software on your own infrastructure. There are costs, of course, but they are ones that are definitely worth paying.

Capturing errors allows for later review, eases debugging, and allows you to respond quickly in situations when errors suddenly start to accumulate. But this is a reactive approach in situations when users have likely been exposed to buggy software.

Mindful developers want to observe their applications all the time and be able to react before actual failures occur. This type of application observability can be implemented by instrumenting your code with custom metrics. Let's learn about various types of custom metrics and systems used for metrics collection.

Instrumenting code with custom metrics

If we want to keep our application running smoothly, we need to be proactive. Observability isn't only about being able to do post-mortem analysis of logs and error reports. It is also about collecting various metrics that provide insights about service load, performance, and resource usage. If you monitor how your application behaves during normal operation, you will be able to spot anomalies and anticipate failures before they happen.

The key in monitoring software is defining metrics that will be useful in determining the general service health. Typical metrics can be divided into a few categories:

  • Resource usage metrics: Typical metrics are memory, disk, network, and CPU time usage. You should always monitor these metrics because every infrastructure has limited resources. That's true even for cloud services, which provide seemingly unlimited resource pools. If one service has abnormal resource usage, it can starve other services and even induce cascading failures in your infrastructure. This is especially important when running code on cloud infrastructure where you often pay for the resources that you use. A resource-hungry service that goes rogue can cost you a lot of money.
  • Load metrics: Typical metrics are the number of connections and the number of requests in a given time frame. Most services can accept a limited number of parallel connections and their performance degrades after reaching a specific threshold. The usual pattern for overloaded services is a gradual performance drop followed by sudden unavailability after reaching the critical load point. Monitoring load metrics allows deciding whether you need to scale out or not. Low load periods can also be an opportunity to scale your infrastructure in to reduce operating costs.
  • Performance metrics: Typical metrics are request or task processing times. Performance metrics are often correlated with load metrics but can also highlight performance hotspots in code that needs optimization. Good application performance improves user experience and allows you to save money on infrastructure costs because performant code will likely need fewer resources to run. The continuous monitoring of performance allows discovering performance regressions that can happen when introducing new changes in an application.
  • Business metrics: These are the key performance metrics of your business. Typical examples might be the number of signups or sold items in a given time frame. Anomalies in these metrics allow you to discover functional regressions that slipped through the testing process (a defective cart checkout process, for instance) or evaluate dubious changes in the application interface that may confuse frequent users.

Some of those metrics can sometimes be derived from application logs, and many mature logging infrastructures can perform rolling aggregations on log event streams. This practice works best for load metrics and performance metrics, which often can be reliably derived from the access logs of web servers, proxies, or load balancers. The same approach can be used also for business metrics, although that requires careful consideration of log formats for key business events and can be very brittle.

Log processing has the least utility in the case of resource usage metrics. That's because resource metrics are based on regular probing of resource usage, and log processing is concerned with streams of discrete events that often happen on a rather irregular basis. Also, centralized log infrastructures are not well suited for storing time series data.

Usually, to monitor application metrics, we use dedicated infrastructure that is independent of logging infrastructure. Advanced metric monitoring systems offer complex metric aggregation and correlation capabilities and are focused on dealing with time series data. Metric infrastructure systems also offer faster access to the most recent data because they are more concerned with observing live systems than logging infrastructures. With logging infrastructures, it is more common to observe noticeable lag in data propagation.

There are generally two architectures of metric monitoring systems:

  • Push architectures: In this architecture, the application is responsible for pushing data to the system. It is usually a remote metrics server or local metric daemon (metric forwarder) responsible for pushing metrics to a higher-level daemon on a different server (layered architecture). The configuration is usually distributed: each service needs to know the location of the destination daemon or server.
  • Pull architectures: In this architecture, the application is responsible for exposing a metrics endpoint (usually an HTTP endpoint) and the metric daemon or server pulls information from known services. The configuration can be either centralized (where the main metric server knows the locations of monitored services) or semi-distributed (the main server knows the location of metric forwarders that pull metrics from lower layers).

Both types of architectures can have mesh-like capabilities by utilizing a service discovery mechanism. For instance, in push architectures, metric forwarders can advertise themselves as services capable of forwarding metrics to other services. In pull architectures, monitored services usually advertise themselves in the service discovery catalog as services that provide metrics to be collected.

One of the popular choices for monitoring applications is Prometheus.

Using Prometheus

Prometheus is a prime example of pull-based metrics infrastructure. It is a complete system capable of collecting metrics, defining forwarding topologies (through so-called metric exporters), data visualization, and alerting. It comes with an SDK for multiple programming languages, including Python.

The architecture of Prometheus (presented in Figure 12.2) consists of the following components:

  • Prometheus server: This is a standalone metrics server that is capable of storing time series data, responding to metrics queries, and retrieving metrics (pulling) from metrics exporters, monitored services (jobs), and other Prometheus servers. A Prometheus server provides an HTTP API for querying data and includes a simple interface that allows us to create various metrics visualizations.
  • Alert manager: This is an optional service that is able to store the definition of alerting rules and trigger notifications when specific alerting conditions are met.
  • Metric exporters: These are the processes that the Prometheus server can pull data from. An exporter can be any service that uses the Prometheus SDK to expose a metrics endpoint. There are also standalone exporters that can probe information directly from hosts (like general host usage information), expose metrics for services without Prometheus SDK integration (like databases), or act as a push gateway for short-lived processes (like cron jobs). Every Prometheus server can act as a metrics exporter for other Prometheus servers.

The above three components are the bare minimum for having a fully functional metrics infrastructure with data visualizations and reliable alerting. Moreover, many Prometheus deployments extend this architecture with the following extra components:

  • Service discovery catalog: A Prometheus server is capable of reading information from various service discovery solutions to locate available metrics exporters. The usage of service discovery mechanisms simplifies configuration and allows for a mesh-like developer experience. Popular choices for service discovery catalogs are Consul, ZooKeeper, and etcd. Container orchestration systems (like Kubernetes) often have built-in service discovery mechanisms.
  • Dashboarding solution: Prometheus servers have a simple web interface capable of performing simple data visualizations. These basic capabilities are often not enough for operations teams. Thanks to its open API, Prometheus can easily be extended with a custom dashboarding solution. The most popular choice is Grafana, which can also be easily integrated with other metrics systems and data sources.

Figure 12.2: Architecture of a typical Prometheus deployment

To see how applications can be easily extended with Prometheus monitoring, we will take one of the applications written in previous chapters and try to furnish it with the Prometheus SDK. We will also define a small Docker Compose setup that will allow us to evaluate the whole solution locally.

For evaluating metrics systems, it is essential to have an application that actually does something useful. One practical example that served us well in various parts of the book was the pixel tracking service from Chapter 5, Interfaces, Patterns, and Modularity. You should be familiar with how it works, so we will use it as the basis of our experiment.

The heart of our application is the tracking.py file. It includes module imports, definitions of views, HTTP route bindings, and an instance of a Flask application object. If we skip the view functions, it looks roughly as follows:

from flask import Flask, request, Response
from flask_injector import FlaskInjector
from interfaces import ViewsStorageBackend
import di
app = Flask(__name__)
@app.route("/track")
def track(storage: ViewsStorageBackend):
   ...
@app.route("/stats")
def stats(storage: ViewsStorageBackend):
    ...
@app.route("/test")
def test():
    ...
if __name__ == "__main__":
    FlaskInjector(app=app, modules=[di.RedisModule()])
    app.run(host="0.0.0.0", port=8000)

For the sake of brevity, we are omitting parts of the application code that won't deal directly with metrics. Full code examples of the pixel tracking service can be found in Chapter 5, Interfaces, Patterns, and Modularity. You can also find all the source code for this section (including the configuration, Dockerfile, and docker-compose.yml) in the Chapter 12/05 - Using Prometheus directory of the code repository for this book (see the Technical requirements section).

In order to make our application observable by the Prometheus server, we need to embed it with a Prometheus metrics exporter. We will do that using the official prometheus-client package available on PyPI. We start by defining some metric objects using classes from the prometheus_client module:

from prometheus_client import Summary, Gauge, Info
REQUEST_TIME = Summary(
    "request_processing_seconds",
    "Time spent processing requests"
)
AVERAGE_TOP_HITS = Gauge(
    "average_top_hits",
    "Average number of top-10 page counts "
)
TOP_PAGE = Info(
    "top_page",
    "Most popular referrer"
)

REQUEST_TIME is a Summary metric that can be used to track the size and number of events. We will use it to track the time and number of processed requests.

AVERAGE_TOP_HITS is a Gauge metric that can be used to track how a single value changes over time. We will use it to track the average number of hit counts of the 10 most popular pages.

TOP_PAGE is an Info metric that can be used to expose any textual information. We will use it to track the most popular pixel-tracked page at any given time.

Individual metric values can be updated using various code patterns. One of the more popular ways is to use specific metric methods as decorators. This is the most convenient way to measure time spent in a function. We will use this method to track the REQUEST_TIME metric as in the following example:

@app.route("/track")
@REQUEST_TIME.time()
def track(storage: ViewsStorageBackend):
    ...
@app.route("/stats")
@REQUEST_TIME.time()
def stats(storage: ViewsStorageBackend):
    ...
@app.route("/test")
@REQUEST_TIME.time()
def test():
    ...

Another way is to use specific metric object methods as normal function calls. This is a common technique for counters, gauges, and info metrics. We will use this pattern to track the values of the AVERAGE_TOP_HITS and TOP_PAGE metrics. We can get a quite good estimate by inspecting the values of the stats() view function:

@app.route("/stats")
@REQUEST_TIME.time()
def stats(storage: ViewsStorageBackend):
    counts: dict[str, int] = storage.most_common(10)
    AVERAGE_TOP_HITS.set(
        sum(counts.values()) / len(counts) if counts else 0
    )
    TOP_PAGE.info({
        "top": max(counts, default="n/a", key=lambda x: counts[x])
    })
    return counts

When the metrics are defined, we can finally embed the metric exporter. This can be done either by starting a separate metrics thread using the prometheus_client.start_http_server() function or using specific integration handlers. prometheus-client comes with nice Flask support through Werkzeug's DispatcherMiddleware class.

Werkzeug is a toolkit for creating web applications based on the WSGI interface. Flask is built using Werkzeug, so it is compatible with Werkzeug middlewares. You can learn more about Werkzeug at https://palletsprojects.com/p/werkzeug/.

We will use exactly this solution:

from prometheus_client import make_wsgi_app
from werkzeug.middleware.dispatcher import DispatcherMiddleware
app.wsgi_app = DispatcherMiddleware(app.wsgi_app, {
    '/metrics': make_wsgi_app()
})
if __name__ == "__main__":
    app.run(host="0.0.0.0", port=8000)

prometheus-client uses thread-safe in-memory storage for metric objects. That's why it works best with a threaded concurrency model (see Chapter 6, Concurrency). Due to CPython threading implementation details (mainly Global Interpreter Lock), multiprocessing is definitely a more popular concurrency model for web applications. It is possible to use prometheus-client in multiprocessing applications, although it requires slightly more setup. For details, see the official client documentation at https://github.com/prometheus/client_python.

When we start our application and visit http://localhost:8000/test and http://localhost:8000/stats in a web browser, we will automatically populate metric values. If you the visit the metrics endpoint at http://localhost:8000/metrics, you will see output that may look as follows:

# HELP python_gc_objects_collected_total Objects collected during gc
# TYPE python_gc_objects_collected_total counter
python_gc_objects_collected_total{generation="0"} 595.0
python_gc_objects_collected_total{generation="1"} 0.0
python_gc_objects_collected_total{generation="2"} 0.0
# HELP python_info Python platform information
# TYPE python_info gauge
python_info{implementation="CPython",major="3",minor="9",patchlevel="0",version="3.9.0"} 1.0
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 1.88428288e+08
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.13999999999999999
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 7.0
# HELP request_processing_seconds Time spent processing requests
# TYPE request_processing_seconds summary
request_processing_seconds_count 1.0
request_processing_seconds_sum 0.0015633490111213177
# HELP request_processing_seconds_created Time spent processing requests
# TYPE request_processing_seconds_created gauge
request_processing_seconds_created 1.6180166638851087e+09
# HELP average_top_hits Average number of top-10 page counts
# TYPE average_top_hits gauge
average_top_hits 6.0
# HELP top_page_info Most popular referrer
# TYPE top_page_info gauge
top_page_info{top="http://localhost:8000/test"} 1.0

As you can see, the output is a human- and machine-readable representation of the current metric values. Besides our custom metrics, it also includes useful default metrics regarding garbage collection and resource usage. These are exactly the same metrics that the Prometheus server will pull from our service.

Now it's time to set up the Prometheus server. We will use Docker Compose so you won't have to install it manually on your own development host. Our docker-compose.yml file will include the definition of three services:

version: "3.7"
services:
  app:
    build:
      context: .
    ports:
      - 8000:8000
    volumes:
      - ".:/app/"
  redis:
    image: redis
  prometheus:
    image: prom/prometheus:v2.15.2
    volumes:
      - ./prometheus/:/etc/prometheus/
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.path=/prometheus'
      - '--web.console.libraries=/usr/share/prometheus/console_libraries'
      - '--web.console.templates=/usr/share/prometheus/consoles'
    ports:
      - 9090:9090
    restart: always

The app service is our main application container. We will build it from a local Dockerfile containing the following code:

FROM python:3.9-slim
WORKDIR app
RUN pip install 
 Flask==1.1.2 
 redis==3.5.3 
 Flask_Injector==0.12.3 
 prometheus-client==0.10.1
ADD *.py ./
CMD python3 tracking.py –reload

The redis service is a container running the Redis data store. It is used by our pixel-tracking application to store information about visited page counts.

Last but not least is the prometheus service with our Prometheus server container. We override the default prometheus image command to provide custom configuration locations. We need a custom configuration file mounted as a Docker volume to tell Prometheus about our metrics exporter location. Without configuration, the Prometheus server won't know where to pull metrics from. We don't have a service discovery catalog, so we will use a simple static configuration:

global:
  scrape_interval:     15s
  evaluation_interval: 15s
  external_labels:
      monitor: 'compose'
scrape_configs:
  - job_name: 'prometheus'
    scrape_interval: 5s
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'app'
    scrape_interval: 5s
    static_configs:
      - targets: ["app:8000"]

We can start the whole solution using the docker-compose command as follows:

$ docker-compose up

The initial docker-compose run may take a little longer because Docker will have to download or build images that do not exist on your filesystem yet.

When all services are running, you will notice in the docker-compose output that the Prometheus server is asking the app service for metrics every few seconds:

app_1 | 172.21.0.3 - - [10/Apr/2021 01:49:09] "GET /metrics HTTP/1.1" 200 -
app_1 | 172.21.0.3 - - [10/Apr/2021 01:49:14] "GET /metrics HTTP/1.1" 200 -
app_1 | 172.21.0.3 - - [10/Apr/2021 01:49:19] "GET /metrics HTTP/1.1" 200 -
app_1 | 172.21.0.3 - - [10/Apr/2021 01:49:24] "GET /metrics HTTP/1.1" 200 -

You can access the Prometheus server web interface at http://localhost:9090. It allows you to browse registered services and perform simple metrics visualization (as in Figure 12.3). The following is an example query that will give you an average response time per request over a 5-minute window:

rate(request_processing_seconds_sum[5m])
/ rate(request_processing_seconds_count[5m])

The results, displayed in Prometheus, will look similar to the following:

Figure 12.3: Example of visualizations in the Prometheus server web interface

As you can see, the Prometheus UI can present plots of your custom metrics. It can give an overview of how your application behavior changes over time. It can be a useful tool in spotting anomalies or ensuring that your application meets the desired performance levels. Give it a try and test a few queries. Try to use application endpoints and see how the data is reflected in your queries.

Using metrics allows you to have better knowledge of how your application works in production environments. It can also provide you with insight into how well it fulfills its purpose by tracking business metrics. Advanced queries allow you to correlate metrics of various types and from various services. This is especially useful in distributed systems, which can have many moving parts.

But using a metrics infrastructure isn't the only way to have better visibility into a distributed system. A completely different approach is to measure each individual transaction and the relationships between related transactions happening in various system components. This methodology is called distributed tracing and it tries to connect elements of logging and traditional metrics infrastructures.

Distributed application tracing

Distributed and networked systems are inevitable in any large-scale infrastructure. That's because you can put only a limited amount of hardware in a single server rack. If you have to serve a large number of users concurrently, you will have to scale out eventually. Also, having all of your software running on a single host will be risky from a reliability and availability standpoint. If one machine fails, the whole system goes down. And it isn't that uncommon to see even whole datacenters being brought down due to natural disasters or other unpredictable events. This means that highly available systems are often forced to be spread over multiple datacenters located in different regions or even managed by different providers, just to ensure enough redundancy.

Another reason for distributed systems is splitting infrastructures into independent domain services in order to split large codebases and enable multiple teams to work efficiently without stepping on each others' toes. That allows you to reduce elaborate change management processes, simplify deployments of large systems, and limit the amount of orchestration between independent development teams.

But distributed systems are hard. Networks are unreliable and introduce delays. Unnecessary communication roundtrips add up with every service-to-service hop and can have a noticeable impact on application performance. Also, every new host increases operational costs and the failure surface. Organizations with large and distributed architectures usually have dedicated teams responsible for keeping them healthy.

This is also true for clients of cloud services like AWS, Azure, or Google Cloud Platform, as the complexity of these cloud environments requires skilled and trained professionals that know how to harness the configuration in an auditable and maintainable way.

But the real nightmare of distributed systems (from a developer's perspective) is observability and debugging. If you have many applications and services running on multiple hosts, it is really hard to get a good overview of what's actually happening inside of your system. A single web transaction can, for instance, go through several layers of a distributed architecture (web balancers, caching proxies, backend APIs, queues, databases) and involve a dozen independent services in the process (see Figure 12.4):

Figure 12.4: Example of an elaborate distributed application architecture with many inter-service connections

With architectures that complex, it is hard to know what the real source of the user's problems is. Logging and error tracking via dedicated systems can help to discover unhandled exceptions but doesn't help that much if erroneous behavior does not produce any observable notifications. Also, errors in a single service can cascade into errors in dependent services. Classic error tracking and logging solutions may produce a lot of noise when such a cascade happens. It will then be harder to track the real root cause of a problem.

When it comes to understanding performance issues, classic metrics systems can highlight obvious performance hotspots but usually do so on a single service level. Common reasons for performance problems in distributed systems are inefficient service usage patterns and not the isolated performance characteristics of a single service. Consider for instance the following example of a distributed transaction from an imaginary eCommerce website:

  1. The user requests data about an order summary from Checkout Service.
  2. Checkout Service retrieves the user session identifier from the request headers and asks Authentication Service to verify if it is valid:
    • Authentication Service verifies the session identifier against a list of active sessions in the relational database acting as a session store.
    • If the session is valid, Authentication Service returns a successful response.
  3. If the session is valid, Checkout Service retrieves the list of items in the user's cart that is stored in another relational database.
  4. For every entry in the cart, Checkout Service asks Inventory Service if there are enough items in the warehouse to fulfill the order. To ensure authorized access, it forwards the user's session identifier in the request header:
    • For every request, Inventory Service asks Authentication Service to validate the session identifier.
    • If the session is valid, Inventory Service checks the warehouse inventory state for the requested item type in its own database and returns the result.
  5. If the order can be fulfilled, Checkout Service makes a series of requests to Pricing Service to obtain the current item prices and tax charges for every entry in the cart:
    • For every request, Checkout Service asks Authentication Service to validate the session identifier.
    • If the session is valid, Pricing Service checks the warehouse inventory state for the requested item type in its own database and returns the result.
  6. Checkout Service then returns a complete order summary for the user's cart contents.

Do you see the pattern there? There is a lot of redundancy, network hopping, and potential sources of problems. And there is definitely a lot of room for improvement, like reducing unnecessary requests to Authentication Service or batching queries to dependent services.

It is debatable whether distributed applications should be designed in such a granular way. Distributed architecture has its merits, like scalability on an individual service level, distributed code ownership, and (usually) faster release processes, but introduces a whole new class of problems to deal with. Unfortunately, with the advent of microservice architectures, inter-service communication usually gets out of hand quickly. Moreover, with independent teams working on separate services it is not uncommon to see sudden performance regressions due to additional communication links.

Traditional observability techniques like collecting logs and per-service metrics may not be good enough to provide complete insight into a distributed system's operation. What we usually need instead is the ability to track whole user interactions as atomic distributed transactions that span across many services and many hosts within the whole system. This technique is called distributed tracing and often combines the features of traditional logging and metrics solutions.

As with logging and metric collection, there are good paid SaaS solutions for providing whole-stack distributed tracing capabilities for architectures of virtually any size. These may be pricy, but there are also good open-source solutions for building distributed tracing infrastructures on your own. One of the more popular open-source implementations of distributed tracing is Jaeger.

As with every other observability solution, you need to carefully evaluate the "build versus buy" factors when deciding to use such a system. There are high-quality, open-source distributed tracing solutions available, but they all require some expertise and maintaining your own infrastructure. Hiring skilled professionals costs money and hardware doesn't come for free either. Depending on your scale and needs, it may actually be cheaper to pay some other company to run and maintain the whole distributed tracing infrastructure for you.

Distributed tracing with Jaeger

Many distributed tracing solutions are based on the OpenTracing standard. OpenTracing is an open protocol and a set of libraries for various programming languages that can be used to instrument code with the ability to send transaction traces to an OpenTracing compatible server. Jaeger is one of the most popular implementations of such servers.

OpenTelemetry is a new standard that is supposed to supersede the OpenTracing (an open protocol for tracing collection) and OpenCensus (an open protocol for metrics and traces collection) standards. OpenTelementry is a protocol that is backward-compatible with former ones and future versions of Jaeger are expected to support the OpenTelemetry standard as well.

The key concept of the OpenTracing standard is the span. A span is a building block of a distributed transaction trace that represents a logical operation within the transaction. Every trace consists of one or more spans, and spans can reference other spans in a nested fashion. Every span contains the following information:

  • The name of the operation that the span represents
  • A pair of timestamps that define the start and end of a span
  • A set of span tags that allow you to query, filter, and analyze traces
  • A set of span logs that are specific to the operation within the span
  • A span context that is a container for cross-process information carried over between spans

In distributed systems, spans usually represent complete request-response cycles within a single service. OpenTracing libraries allow you also to easily define smaller spans that can be used to track smaller logical blocks of processing like database queries, file access requests, or individual function calls.

OpenTracing provides the opentracing-python package on PyPI, but we can't use it to interact with the Jaeger server. It is a reference implementation that is just an empty shell to be extended by actual implementations of OpenTracing. For instance, for Jaeger users, we will use the official jaeger-client package. Instrumenting code with jaeger-client is really simple.

Despite the need to use implementation-specific libraries like jaeger-client, the OpenTracing architecture is designed in a way that makes it very easy to migrate between various implementations.

It would be great if we had some distributed service to evaluate distributed tracing with Jaeger, but our simple pixel-tracking application will do the job too. It maintains connections to the Redis server and is complex enough to give us the opportunity to create our own custom spans.

The integration with jaeger-client starts with initializing the tracer object:

from jaeger_client import Config
tracer = Config(
    config={
        'sampler': {
            'type': 'const',
            'param': 1,
        },
    },
    service_name="pixel-tracking",
).initialize_tracer()

The 'sampler' section in the tracer config defines the event sampling strategy. Our setting uses a constant sampling strategy with a value of 1. This means that all transactions will be reported to the Jaeger server.

Jaeger provides multiple event sampling strategies. The correct sampling strategy depends on the expected service load and the scale of the Jaeger deployment. You can read more about Jaeger sampling at https://www.jaegertracing.io/docs/1.22/sampling/.

Every configuration should be created with the service_name argument. This allows Jaeger to tag and identify spans coming from the same service and allows for better traces. In our case, we've set service_name to "pixel-tracking".

Once we have an instance of tracer, we can start defining spans. The most convenient way to do this is through the context manager syntax as in the following example:

@app.route("/stats")
def stats(storage: ViewsStorageBackend):
    with tracer.start_span("storage-query"):
        return storage.most_common(10)

Applications created with the use of web frameworks usually have multiple request handlers, and you usually want to track all of them. Instrumenting every request handler manually would be unproductive and error prone. That's why it is generally better to use a framework-specific OpenTracing integration that will automate this process. For Flask we can use the Flask-Opentracing package from PyPI. You can enable this integration by simply creating a FlaskTracing class instance in your main application module:

from flask import Flask
from flask_opentracing import FlaskTracing
from jaeger_client import Config
app = Flask(__name__)
tracer = Config(
    config={'sampler': {'type': 'const', 'param': 1}},
    service_name="pixel-tracking",
).initialize_tracer()
FlaskTracing(tracer, app=app)

Another useful technique is to enable automatic integration for libraries that we use to communicate with external services, databases, and storage engines. That way, we will be able to track outgoing transactions and OpenTracing will build a relationship between spans coming from various services.

In our example, we have only one service, so there is no opportunity to correlate distributed spans. But we use Redis as a datastore, so we could at least instrument queries made to Redis. There's a dedicated package on PyPI for that purpose named redis_opentracing. To enable this integration, you need to perform only a single function call:

import redis_opentracing
redis_opentracing.init_tracing(tracer)

You can find all the source code for this section (including the configuration, Dockerfile, and docker-compose.yml) in the Chapter 12/ Distributed tracing with Jaeger directory of code repository for this book (see the Technical requirements section).

redis_opentracing is a new package, so we will have to modify our Dockerfile as follows:

FROM python:3.9-slim
WORKDIR app
RUN pip install 
 Flask==1.1.2 
 redis==3.5.3 
 Flask_Injector==0.12.3 
 prometheus-client==0.10.1 
 jaeger-client==4.4.0 
 opentracing==2.4.0 
 Flask-OpenTracing==1.1.0
RUN pip install --no-deps redis_opentracing==1.0.0
ADD *.py ./
CMD python3 tracking.py --reload

Note that we've installed redis_opentracing with pip install --no-deps. This tells pip to ignore package dependencies. Unfortunately, at the time of writing, install_requires of redis_opentracing does not list opentracing=2.4.0 as a supported package version, although it works with it pretty well. Our approach is a trick to ignore this dependency conflict. Hopefully, the new release of redis_opentracing will resolve this issue.

Last but not least, we need to start the Jaeger server. We can do that locally using Docker Compose. We will use the following docker-compose.yml file to start our pixel-tracking application, Redis server, and Jaeger server:

version: "3.7"
services:
  app:
    build:
      context: .
    ports:
      - 8000:8000
    environment:
      - JAEGER_AGENT_HOST=jaeger
    volumes:
      - ".:/app/"
  redis:
    image: redis
  jaeger:
    image: jaegertracing/all-in-one:latest
    ports:
      - "6831:6831/udp"
      - "16686:16686"

The Jaeger server (similarly to the Prometheus server) consists of a few components. The jaegertracing/all-in-one:latest Docker image is a convenient packaging of all those components for simple deployments or for local experimentation. Note that we used the JAEGER_AGENT_HOST environment variable to tell the Jaeger client about the location of our Jaeger server. This is a pretty common pattern among observability SDKs that allows for the easy swapping of solutions without affecting the configuration of the monitored application.

Once we have everything set up, we can start the whole solution using Docker Compose:

$ docker-compose up

The initial docker-compose run may take a little longer because Docker will have to download or build non-existent images.

When all services are running, you can visit http://localhost:8000/test and http://localhost:8000/stats in your browser to generate some traces. You can visit http://localhost:16686/ to access the web interface of the Jaeger server and browse the collected traces.

The example trace, displayed in Jaeger, may look similar to the following:

Figure 12.5: Example visualization of a distributed trace in the Jaeger web interface with multiple nested spans visible

As you can see, the pixel-tracing: stats transaction consists of three spans:

  • stats: This is the topmost span of the stats() endpoint, added automatically by the FlaskTracing middleware.
  • storage-query: This is the middle span, added manually using the tracer.start_span("storage-query") context manager.
  • ZVREVRANGE: This is the span added automatically by the redis_opentracing integration.

Every trace comes with a precise time measurement and additional tags like IP address, library version, or span-specific information (database statement, HTTP response code, and so on). These elements are indispensable in pinpointing performance issues and understanding communication patterns between components of distributed systems.

Summary

We've discussed the three main observability techniques of modern applications: logging, metrics collection, and distributed tracing. All of them have their advantages, but logging is definitely the most important way of collecting information from your application. That's because it is simple, does not require any special infrastructure (although it is good to have one), and is least likely to fail you.

But logging has some limitations. Unstructured logging messages make it harder to extract insights from logs. Logging is also not suitable for the periodic probing of information about resource usage and performance. It is good for auditing purposes and post-mortem analysis after major failures but is rarely helpful in tracking current information and reacting to sudden events.

That's why metrics collection systems are a natural and valuable extension of logging infrastructures. They allow you to collect information in real time, create custom metrics, and create powerful operational dashboards that can be actively monitored by operations teams. They can also be used to some extent to provide an overview of business metrics, although larger organizations usually prefer dedicated analytic software. Metrics systems are indispensable in monitoring application performance.

Last but not least, there are distributed tracing infrastructures, which are great in complex systems with many moving pieces. These are indispensable in situations where issues arise at the meeting points between multiple communicating services. They give a great overview of the information flow within the system and make it easy to spot communication anomalies.

Each type of observability solution provides a useful view of what's going on within your system but no single one solves all of your problems. It would be best if we could have all of them, but unfortunately, not every organization can afford to build or buy all of these solutions. If you have constrained resources and need to pick only one, you should definitely consider logging first, because it will usually have the greatest return on your investment.

When you have an efficient observability solution in place, you will start discovering intricate problems and performance issues. Some of them will really surprise you. In the end, the best way to evaluate software is to run it in the desired environment and under real-life load. You will usually notice that most of your performance issues are concentrated in isolated hotspots. Once you have identified them, it's time for detailed performance analysis and optimization—the topic of our next chapter.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.12.108.236