13

Application Integration: Configuration

Python's concept of an extensible library gives us rich access to numerous computing resources. The language provides avenues to make even more resources available. This makes Python programs particularly strong at integrating components to create sophisticated composite processing. In this chapter, we'll address the fundamentals of creating complex applications: managing configuration files, logging, and a design pattern for scripts that permits automated testing.

These new recipes are based on recipes shown earlier. Specifically, in the Using argparse to get command-line input, Using cmd for creating command-line applications, and Using the OS environment settings recipes in Chapter 6, User Inputs and Outputs, some specific techniques for creating top-level (main) application scripts were shown. In Chapter 10, Input/Output, Physical Format, and Logical Layout, we looked at filesystem input and output. In Chapter 12, Web Services, we looked at creating servers, which are the main applications that receive requests from clients.

All of these examples show some aspects of application programming in Python. There are some additional techniques that are helpful, such as processing configuration from files. In the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs, we showed techniques for parsing command-line arguments. In the Using the OS environment settings recipe, we touched on other kinds of configuration details. In this chapter, we'll look at a number of ways to handle configuration files. There are many file formats that can be used to store long-term configuration information:

  • The INI file format as processed by the configparser module.
  • The YAML file format is very easy to work with but requires an add-on module that's not currently part of the Python distribution. We'll look at this in the Using YAML for configuration files recipe.
  • The Properties file format is typical of Java programming and can be handled in Python without writing too much code. The syntax overlaps with Python scripts.
  • For Python scripts, a file with assignment statements looks a lot like a Properties file, and is very easy to process using the compile() and exec() functions. We'll look at this in the Using Python for configuration files recipe.
  • Python modules with class definitions is a variation that uses Python syntax but isolates the settings into separate classes. This can be processed with the import statement. We'll look at this in the Using class-as-namespace for configuration recipe.

This chapter will extend some of the concepts from Chapter 7, Basics of Classes and Objects, and Chapter 8, More Advanced Class Design, and apply the idea of the command design pattern to Python programs.

In this chapter, we'll look at the following recipes:

  • Finding configuration files
  • Using YAML for configuration files
  • Using Python for configuration files
  • Using class-as-namespace for configuration values
  • Designing scripts for composition
  • Using logging for control and audit output

We'll start with a recipe for handling multiple configuration files that must be combined. This gives users some helpful flexibility. From there, we can dive into the specifics of various common configuration file formats.

Finding configuration files

Many applications will have a hierarchy of configuration options. The foundation of the hierarchy is often the default values built into a particular release. These might be supplemented by server-wide (or cluster-wide) values from centralized configuration files. There might be user-specific files, or perhaps even configuration files provided when starting a program.

In many cases, configuration parameters are written in text files, so they are persistent and easy to change. The common tradition in Linux is to put system-wide configuration in the /etc directory. A user's personal changes would be in their home directory, often named ~username or $HOME.

In this recipe, we'll see how an application can support a rich hierarchy of locations for configuration files.

Getting ready

The example we'll use is a web service that provides hands of cards to users. The service is shown in several recipes throughout Chapter 12, Web Services. We'll gloss over some of the details of the service so we can focus on fetching configuration parameters from a variety of filesystem locations.

We'll follow the design pattern of the Bash shell, which looks for configuration files in the following places:

  1. It starts with the /etc/profile file.
  2. After reading that file, it looks for one of these files, in this order:
    1. ~/.bash_profile
    2. ~/.bash_login
    3. ~/.profile

In a POSIX-compliant operating system, the shell expands the ~ to be the home directory for the logged-in user. This is defined as the value of the HOME environment variable. In general, the Python pathlib module can handle this automatically via the Path.home() method. This technique applies to Windows and Linux derivatives, as well as macOS.

The design pattern from the Bash shell can use a number of separate files. When we include defaults that are part of the release, application-wide settings as part of an installation, and personal settings, we can consider three levels of configuration. This can be handled elegantly with a mapping and the ChainMap class from the collections module.

In later recipes, we'll look at ways to parse and process specific formats of configuration files. For the purposes of this recipe, we won't pick a specific format. Instead, we'll assume that a function, load_config_file(), has been defined that will load a specific configuration mapping from the contents of the file. The function looks like this:

def load_config_file(config_path: Path) -> Dict[str, Any]:
    """Loads a configuration mapping object with the contents
    of a given file.
    :param config_path: Path to be read.
    :returns: mapping with configuration parameter value
    """
    # Details omitted.

We'll look at a number of different ways to implement this function.

Why so many choices?

There's a side topic that sometimes arises when discussing this kind of design—Why have so many choices? Why not specify exactly two places?

The answer depends on the context for the application. When creating entirely new software, it may be possible to limit the choices to exactly two locations. However, when replacing legacy applications, it's common to have a new location that's better in some ways than the legacy location. This often means the legacy location still needs to be supported. After several such evolutionary changes, it's common to see a number of alternative locations for files.

Also, because of variations among Linux distributions, it's common to see variations that are typical for one distribution, but atypical for another. And, of course, when dealing with Windows, there will be variant file paths that are unique to that platform.

How to do it...

We'll make use of the pathlib module to provide a handy way to work with files in various locations. We'll also use the collections module to provide the very useful ChainMap class:

  1. Import the Path class and the collections module. There are several type hints that are also required:
    from pathlib import Path
    import collections
    from typing import TextIO, Dict, Any, ChainMap
    
  2. Define an overall function to get the configuration files:
    def get_config() -> ChainMap[str, Any]:
    
  3. Create paths for the various locations of the configuration files. These are called pure paths because there's no relationship with the filesystem. They start as names of potential files:
        system_path = Path("/etc") / "profile"
        local_paths = [
            Path.home() / ".bash_profile",
            Path.home() / ".bash_login",
            Path.home() / ".profile",
        ]
    
  4. Define the application's built-in defaults:
        configuration_items = [
            dict(
                some_setting="Default Value",
                another_setting="Another Default",
                some_option="Built-In Choice",
            )
        ]
    
  5. Each individual configuration file is a mapping from keys to values. Each of these mapping objects is combined to form a list; this becomes the final ChainMap configuration mapping. We'll assemble the list of maps by appending items, and then reverse the order after the files are loaded so that the last loaded file becomes the first in the map.
  6. If a system-wide configuration file exists, load this file:
    if system_path.exists():
        configuration_items.append(
            load_config_file(system_path))
    
  7. Iterate through other locations looking for a file to load. This loads the first file that it finds and uses a break statement to stop after the first file is found:
        for config_path in local_paths:
           if config_path.exists():
               configuration_items.append(
                   load_config_file(config_path))
                break
    
  8. Reverse the list and create the final ChainMap. The list needs to be reversed so that the local file is searched first, then the system settings, and finally the application default settings:
        configuration = collections.ChainMap(
            *reversed(configuration_items))
        return configuration
    

Once we've built the configuration object, we can use the final configuration like a simple mapping. This object supports all of the expected dictionary operations.

How it works...

One of the most elegant features of any object-oriented language is being able to create collections of objects. In this case, one of these collections of objects includes filesystem Path objects.

As noted in the Using pathlib to work with file names recipe in Chapter 10, Input/Output, Physical Format, and Logical Layout, the Path object has a resolve() method that can return a concrete Path built from a pure Path. In this recipe, we used the exists() method to determine if a concrete path could be built. The open() method, when used to read a file, will resolve the pure Path and open the associated file.

In the Creating dictionaries – inserting and updating recipe in Chapter 4, Built-In Data Structures Part 1: Lists and Sets, we looked at the basics of using a dictionary. Here, we've combined several dictionaries into a chain. When a key is not located in the first dictionary of the chain, then later dictionaries in the chain are checked. This is a handy way to provide default values for each key in the mapping.

Here's an example of creating a ChainMap manually:

>>> import collections
>>> config = collections.ChainMap(
...     {'another_setting': 2},
...     {'some_setting': 1},
...     {'some_setting': 'Default Value',
...      'another_setting': 'Another Default',
...      'some_option': 'Built-In Choice'})

The config object is built from three separate mappings. The first might be details from a local file such as ~/.bash_login. The second might be system-wide settings from the /etc/profile file. The third contains application-wide defaults.

Here's what we see when we query this object's values:

>>> config['another_setting'] 
2 
>>> config['some_setting'] 
1 
>>> config['some_option'] 
'Built-In Choice' 

The value for any given key is taken from the first instance of that key in the chain of maps. This is a very simple way to have local values that override system-wide values that override the built-in defaults.

There's more...

In the Mocking external resources recipe in Chapter 11, Testing, we looked at ways to mock external resources so that we could write a unit test that wouldn't accidentally delete files. A test for the code in this recipe needs to mock the filesystem resources by mocking the Path class.

To work with pytest test cases, it helps to consolidate the Path operations into a fixture that can be used to test the get_config() function:

from pathlib import Path
from pytest import *  # type: ignore
from unittest.mock import Mock, patch, mock_open, MagicMock, call
import Chapter_13.ch13_r01
@fixture  # type: ignore
def mock_path(monkeypatch, tmpdir):
    mocked_class = Mock(
        wraps=Path,
        return_value=Path(tmpdir / "etc"),
        home=Mock(return_value=Path(tmpdir / "home")),
    )
    monkeypatch.setattr(
        Chapter_13.ch13_r01, "Path", mocked_class)
    (tmpdir / "etc").mkdir()
    (tmpdir / "etc" / "profile").write_text(
        "exists", encoding="utf-8")
    (tmpdir / "home").mkdir()
    (tmpdir / "home" / ".profile").write_text(
        "exists", encoding="utf-8")
    return mocked_class

This mock_path fixture creates a module-like Mock object that can be used instead of the Path class. When the code under test uses Path() it will always get the etc/profile file created in the tmpdir location. The home attribute of this Mock object makes sure that Path.home() will provide a name that's part of the temporary directory created by tmpdir. By pointing the Path references to the temporary directory that's unique to the test, we can then load up this directory with any combination of files.

This fixture creates two directories, and a file in each directory. One file is tmpdir/etc/profile. The other is tmpdir/home/.profile. This allows us to check the algorithm for finding the system-wide profile as well as a user's local profile.

In addition to a fixture that sets up the files, we'll need one more fixture to mock the details of the load_config_file() function, which loads one of the configuration files. This allows us to define multiple implementations, confident that the overall get_config() function will work with any implementation that fills the contract of load_config_file().

The fixture looks like this:

@fixture  # type: ignore
def mock_load_config(monkeypatch):
    mocked_load_config_file = Mock(return_value={})
    monkeypatch.setatt(
        Chapter_13.ch13_r01,
        "load_config_file", 
        mocked_load_config_file
    )
    return mocked_load_config_file

Here are some of the tests that will confirm that the path search works as advertised. Each test starts by applying two patches to create a modified context for testing the get_config() function:

def test_get_config(mock_load_config, mock_path):
    config = Chapter_13.ch13_r01.get_config()
    assert mock_path.mock_calls == [
        call("/etc"),
        call.home(),
        call.home(),
        call.home(),
    ]
    assert mock_load_config.mock_calls == [
        call(mock_path.return_value / "profile"),
        call(mock_path.home.return_value / ".profile"),
    ]

The two fixtures mock the Path class and also mock the load_config_file() function that the get_config() function relies on. The assertion shows that several path requests were made, and two individual files were eventually loaded. This is the purpose behind this particular get_config() function; it loads two of the files it finds. To be complete, of course, the test suite needs to have two more fixtures and two more tests to examine the other two locations for user-specific configuration files.

See also

  • In the Using YAML for configuration files and Using Python for configuration files recipes in this chapter, we'll look at ways to implement the load_config_file() function.
  • In the Mocking external resources recipe in Chapter 11, Testing, we looked at ways to test functions such as this, which interact with external resources.
  • Variations on the implementation are covered in the Using YAML for configuration files and Using Python for configuration files recipes of this chapter.
  • The pathlib module can help with this processing. This module provides the Path class definition, which provides a great deal of sophisticated information about the OS's files. For more information, see the Using pathlib to work with filenames recipe in Chapter 10, Input/Output, Physical Format, and Logical Layout.

Using YAML for configuration files

Python offers a variety of ways to package application inputs and configuration files. We'll look at writing files in YAML notation because this format is elegant and simple.

It can be helpful to represent configuration details in YAML notation.

Getting ready

Python doesn't have a YAML parser built in. We'll need to add the pyyaml project to our library using the pip package management system. Here's what the installation looks like:

(cookbook) slott@MacBookPro-SLott Modern-Python-Cookbook-Second-Edition % python -m pip install pyyaml
Collecting pyyaml
  Downloading https://files.pythonhosted.org/packages/64/c2/b80047c7ac2478f9501676c988a5411ed5572f35d1beff9cae07d321512c/PyYAML-5.3.1.tar.gz (269kB)
     |████████████████████████████████| 276kB 784kB/s 
Building wheels for collected packages: pyyaml
  Building wheel for pyyaml (setup.py) ... done
  Created wheel for pyyaml: filename=PyYAML-5.3.1-cp38-cp38-macosx_10_9_x86_64.whl size=44624 sha256=7450b3cc947c2afd5d8191ebe35cb1c8cdd5e212e0478121cd49ce52c835ddaa
  Stored in directory: /Users/slott/Library/Caches/pip/wheels/a7/c1/ea/cf5bd31012e735dc1dfea3131a2d5eae7978b251083d6247bd
Successfully built pyyaml
Installing collected packages: pyyaml
Successfully installed pyyaml-5.3.1

The elegance of the YAML syntax is that simple indentation is used to show the structure of the document. Here's an example of some settings that we might encode in YAML:

query: 
  mz: 
    - ANZ532 
    - AMZ117 
    - AMZ080 
url: 
  scheme: http 
  netloc: forecast.weather.gov 
  path: /shmrn.php 
description: > 
  Weather forecast for Offshore including the Bahamas 

This document can be seen as a specification for a number of related URLs that are all similar to http://forecast.weather.gov/shmrn.php?mz=ANZ532. The document contains information about building the URL from a scheme, net location, base path, and several query strings. The yaml.load() function can load this YAML document; it will create the following Python structure:

{'description': 'Weather forecast for Offshore including the Bahamas
', 
 'query': {'mz': ['ANZ532', 'AMZ117', 'AMZ080']}, 
 'url': {'netloc': 'forecast.weather.gov', 
         'path': 'shmrn.php', 
         'scheme': 'http'}} 

This dict-of-dict structure can be used by an application to tailor its operations. In this case, it specifies a sequence of URLs to be queried to assemble a larger weather briefing.

We'll often use the Finding configuration files recipe, shown earlier in this chapter, to check a variety of locations for a given configuration file. This flexibility is often essential for creating an application that's easy to use on a variety of platforms.

In this recipe, we'll build the missing part of the previous example, the load_config_file() function. Here's the template that needs to be filled in:

def load_config_file(config_path: Path) -> Dict[str, Any]:
    """Loads a configuration mapping object with contents
    of a given file.
    :param config_path: Path to be read.
    :returns: mapping with configuration parameter values
    """
    # Details omitted.

In this recipe, we'll fill in the space held by the Details omitted line to load configuration files in YAML format.

How to do it...

This recipe will make use of the yaml module to parse a YAML-format file. This will create a dictionary from the YAML-format source. This can be part of building a ChainMap of configurations:

  1. Import the yaml module along with the Path definition and the type hints required by the load_config_file() function definition:
    from pathlib import Path
    from typing import Dict, Any
    import yaml
    
  2. Use the yaml.load() function to load the YAML-syntax document:
    def load_config_file(config_path: Path) -> Dict[str, Any]:
        """Loads a configuration mapping object with contents
        of a given file.
        :param config_path: Path to be read.
        :returns: mapping with configuration parameter values
        """
        with config_path.open() as config_file:
            document = yaml.load(
                config_file, Loader=yaml.SafeLoader)
        return document
    

This function can be fit into the design from the Finding configuration files recipe to load a configuration file using YAML notation.

How it works...

The YAML syntax rules are defined at http://yaml.org. The idea of YAML is to write JSON-like data structures in a more flexible, human-friendly syntax. JSON is a special case of the more general YAML syntax.

The trade-off here is that some spaces and line breaks in JSON don't matter—there is visible punctuation to show the structure of the document. In some of the YAML variants, line breaks and indentation determine the structure of the document; the use of white-space means that line breaks will matter with YAML documents.

The essential data structures available in JSON syntax are as follows:

  • Sequence: [item, item, ...]
  • Mapping: {key: value, key: value, ...}
  • Scalar:
  • String: "value"
  • Number: 3.1415926
  • Literal: true, false, and null

JSON syntax is one style of YAML; it's called a flow style. In this style, the document structure is marked by explicit indicators. The syntax requires {…} and […] to show the structure.

The alternative that YAML offers is block style. The document structure is defined by line breaks and indentation. Furthermore, string scalar values can use plain, quoted, and folded styles of syntax. Here is how the alternative YAML syntax works:

  • Block sequence: We preface each line of a sequence with a -. This looks like a bullet list and is easy to read. When loaded, it will create a dictionary with a list of strings in Python: {zoneid: ['ANZ532', 'AMZ117', 'AMZ080']}. Here's an example:
          zoneid: 
            - ANZ532 
            - AMZ117 
            - AMZ080 
    
  • Block mapping: We can use simple key: value syntax to associate a key with a simple scalar. We can use key: on a line by itself; the value is indented on the following lines. This creates a nested dictionary that looks like this in Python: {'url': {'scheme': 'http', 'netloc': 'marine.weather.gov'}}. Here's an example:
          url: 
            scheme: http 
            netloc: marine.weather.gov 
    

Some more advanced features of YAML will make use of this explicit separation between key and value:

  • For short string scalar values, we can leave them plain, and the YAML rules will simply use all the characters with leading and trailing white-space stripped away. The examples all use this kind of assumption for string values.
  • Quotes can be used for strings, exactly as they are in JSON, when necessary.
  • For longer strings, YAML introduces the | prefix; the lines after this are preserved with all of the spacing and newlines intact. It also introduces the > prefix, which preserves the words as a long string of text—any newlines are treated as single white-space characters. This is common for running text.
  • In some cases, the value may be ambiguous. For example, a US ZIP code is all digits—22102. This should be understood as a string, even though the YAML rules will interpret it as a number. Quotes, of course, can be helpful. To be even more explicit, a local tag of !!str in front of the value will force a specific data type. !!str 22102, for example, assures that the digits will be treated as a string object.

There's more...

There are a number of additional features in YAML that are not present in JSON:

  • The comments, which begin with # and continue to the end of the line. They can go almost anywhere. JSON doesn't tolerate comments.
  • The document start, which is indicated by the --- line at the start of a new document. This allows a YAML file to contain a stream of separate documents.
  • Document end. An optional ... line is the end of a document in a stream of documents.
  • Complex keys for mappings. JSON mapping keys are limited to the available scalar types—string, number, true, false, and null. YAML allows mapping keys to be considerably more complex. We have to honor Python's restriction that keys must be immutable.

Here are some examples of these features. In this first example, we'll look at a stream that contains two documents.

Here is a YAML file with two separate documents, something that JSON does not handle well:

>>> import yaml 
>>> yaml_text = ''' 
... --- 
... id: 1 
... text: "Some Words." 
... --- 
... id: 2 
... text: "Different Words." 
... ''' 
>>> document_iterator = yaml.load_all(yaml_text)
>>> document_1 = next(document_iterator) 
>>> document_1['id'] 
1 
>>> document_2 = next(document_iterator) 
>>> document_2['text'] 
'Different Words.' 

The yaml_text string is a stream with two YAML documents, each of which starts with ---. The load_all() function is an iterator that loads the documents one at a time. An application must iterate over the results of this to process each of the documents in the stream.

YAML provides a way to create complex objects for mapping keys. What's important is that Python requires a hashable, immutable object for a mapping key. This means that a complex key must be transformed into an immutable Python object, often a tuple. In order to create a Python-specific object, we need to use a more complex local tag. Here's an example:

>>> mapping_text = '''
... ? !!python/tuple ["a", "b"]
... : "value"
... '''
>>> yaml.load(mapping_text, Loader=yaml.UnsafeLoader)
{('a', 'b'): 'value'}

This example uses ? and : to mark the key and value of a mapping. We've done this because the key is a complex object. The key value uses a local tag, !!python/tuple, to create a tuple instead of the default, which would have been a list. The text of the key uses a flow-type YAML value, ["a", "b"].

Because this steps outside the default type mappings, we also have to use the special UnsafeLoader. This is a way of acknowledging that a wide variety of Python objects can be created this way.

JSON has no provision for a set collection. YAML allows us to use the !!set tag to create a set instead of a simple sequence. The items in the set must be identified by a ? prefix because they are considered keys of a mapping for which there are no values.

Note that the !!set tag is at the same level of indentation as the values within the set collection. It's indented inside the dictionary key of data_values:

>>> import yaml
>>> set_text = '''
... document:
...     id: 3
...     data_values:
...       !!set
...       ? some
...       ? more
...       ? words
... '''
>>> some_document = yaml.load(set_text, Loader=yaml.SafeLoader)
>>> some_document['document']['id']
3
>>> some_document['document']['data_values'] == {
...     'some', 'more', 'words'}
True

The !!set local tag modifies the following sequence to become a set object instead of the default list object. The resulting set is equal to the expected Python set object, {'some', 'more', 'words'}.

Items in a set must be immutable objects. While the YAML syntax allows creating a set of mutable list objects, it's impossible to build the document in Python. A run-time error will reveal the problem when we try to collect mutable objects into a set.

Python objects of almost any class can be described using YAML local tags. Any class with a simple __init__() method can be built from a YAML serialization.

Here's a small class definition:

class Card:
    def __init__(self, rank: int, suit: str) -> None:
        self.rank = rank
        self.suit = suit
    def __repr__(self) -> str:
        return f"{self.rank} {self.suit}"

We've defined a class with two positional attributes. Here's the YAML serialization of an instance of this class:

!!python/object/apply:Chapter_13.ch13_r02.Card
kwds:
    rank: 7
    suit: 

We've used the kwds key to provide two keyword-based argument values to the Card constructor function. The Unicode character works well because YAML files are text written using UTF-8 encoding.

See also

  • See the Finding configuration files recipe earlier in this chapter to see how to search multiple filesystem locations for a configuration file. We can easily have application defaults, system-wide settings, and personal settings built into separate files and combined by an application.

Using Python for configuration files

Python offers a variety of ways to package application inputs and configuration files. We'll look at writing files in Python notation because it's elegant and simple.

A number of packages use assignment statements in a separate module to provide configuration parameters. The Flask project, in particular, supports this. We looked at Flask in the Using the Flask framework for RESTful APIs recipe and a number of related recipes in Chapter 12, Web Services.

In this recipe, we'll look at how we can represent configuration details in Python notation.

Getting ready

Python assignment statements are particularly elegant. The syntax can be simple, easy to read, and extremely flexible. If we use assignment statements, we can import an application's configuration details from a separate module. This could have a name like settings.py to show that it's focused on configuration parameters.

Because Python treats each imported module as a global Singleton object, we can have several parts of an application all use the import settings statement to get a consistent view of the current, global application configuration parameters.

For some applications, we might want to choose one of several alternative settings files. In this case, we want to load a file using a technique that's more flexible than the fixed import statement.

We'd like to be able to provide definitions in a text file that look like this:

"""Weather forecast for Offshore including the Bahamas
"""
query = {'mz': ['ANZ532', 'AMZ117', 'AMZ080']}
url = {
  'scheme': 'http',
  'netloc': 'forecast.weather.gov',
  'path': '/shmrn.php'
}

This is Python syntax. The parameters include two variables, query and url. The value of the query variable is a dictionary with a single key, mz, and a sequence of values.

This can be seen as a specification for a number of related URLs that are all similar to http://forecast.weather.gov/shmrn.php?mz=ANZ532.

We'll often use the Finding configuration files recipe to check a variety of locations for a given configuration file. This flexibility is often essential for creating an application that's easily used on a variety of platforms.

In this recipe, we'll build the missing part of the first recipe, the load_config_file() function. Here's the template that needs to be filled in:

def load_config_file(config_path: Path) -> Dict[str, Any]:
    """Loads a configuration mapping object with contents
    of a given file.
    :param config_path: Path to be read.
    :returns: mapping with configuration parameter values
    """
    # Details omitted.

In this recipe, we'll fill in the space held by the Details omitted line to load configuration files in Python format.

How to do it...

We can make use of the pathlib module to locate the files. We'll leverage the built-in compile() and exec() functions to process the code in the configuration file:

  1. Import the Path definition and the type hints required by the load_config_file() function definition:
    from pathlib import Path
    from typing import Dict, Any
    
  2. Use the built-in compile() function to compile the Python module into an executable form. This function requires the source text as well as the filename from which the text was read. The filename is essential for creating trace-back messages that are useful and correct:
    def load_config_file(config_path: Path) -> Dict[str, Any]:
        code = compile(
            config_path.read_text(),
            config_path.name,
            "exec")
    

    In rare cases where the code doesn't come from a file, the general practice is to provide a name such as <string> for the filename.

  3. Execute the code object created by the compile() function. This requires two contexts. The global context provides any previously imported modules, plus the __builtins__ module. The local context is the locals dictionary; this is where new variables will be created:
        locals: Dict[str, Any] = {}
        exec(code, {"__builtins__": __builtins__}, locals)
        return locals
    

How it works...

The details of the Python language–the syntax and semantics–are embodied in the built-in compile() and exec() functions. When we launch a Python application or script, the process is essentially this:

  1. Read the text. Compile it with the compile() function to create a code object.
  2. Use the exec() function to execute the code object.

The __pycache__ directory holds code objects, and saves the work of recompiling text files that haven't changed.

The exec() function reflects the way Python handles global and local variables. There are two namespaces (mappings) provided to this function. These are visible to a script that's running via the globals() and locals() functions.

When code is executed at the very top level of a script file—often inside the if __name__ == "__main__" condition—it executes in the global context; the globals and locals variable collections are the same. When code is executed inside a function, method, or class definition, the local variables for that context are separate from the global variables.

Here, we've created a separate locals object. This makes sure the imported statements don't make unexpected changes to any other global variables.

We provided two distinct dictionaries:

  • A dictionary of global objects. The most common use is to provide access to the imported modules, which are always global. The __builtins__ module is often provided in this dictionary. In some cases, other modules like pathlib should be added.
  • The dictionary provided for the locals is updated by each assignment statement. This local dictionary allows us to capture the variables created within the settings module.

The locals dictionary will be updated by the exec() function. We don't expect the globals to be updated and will ignore any changes that happen to this collection.

There's more...

This recipe suggests a configuration file is entirely a sequence of name = value assignments. The assignment statement is in Python syntax, as are the variable names and the literal syntax. This permits Python's large collection of built-in types.

Additionally, the full spectrum of Python statements is available. This leads to some engineering trade-offs.

Because any statement can be used in the configuration file, it can lead to complexity. If the processing in the configuration file becomes too complex, the file ceases to be configuration and becomes a first-class part of the application. Very complex features should be implemented by modifying the application programming, not hacking around with the configuration settings. Since Python applications include the full source, as it is generally easier to fix the source than create hyper-complex configuration files. The goal is for a configuration file to provide values to tailor operations, not provide plug-in functionality.

We might want to include the OS environment variables as part of the global variables used for configuration. This ensures that the configuration values match the current environment settings. This can be done with the os.environ mapping.

It can also be sensible to do some processing simply to make a number of related settings easier to organize. For example, it can be helpful to write a configuration file with a number of related paths like this:

"""Config with related paths"""
if environ.get("APP_ENV", "production"):
    base = Path('/var/app/')
else:
    base = Path.cwd("var")
log = base/'log'
out = base/'out'

The values of log and out are used by the application. The value of base is only used to ensure that the other two paths share a common parent directory.

This leads to the following variation on the load_config_file() function shown earlier. This version includes some additional modules and global classes:

from pathlib import Path
import platform
import os
def load_config_file_xtra(config_path: Path) -> Dict[str, Any]:
    def not_allowed(*arg, **kw) -> None:
        raise RuntimeError("Operation not allowed")
    code = compile(
        config_path.read_text(),
        config_path.name,
        "exec")
    safe_builtins = cast(Dict[str, Any], __builtins__).copy()
    for name in ("eval", "exec", "compile", "__import__"):
        safe_builtins[name] = not_allowed
    globals = {
        "__builtins__": __builtins__,
        "Path": Path,
        "platform": platform,
        "environ": os.environ.copy()
    }
    locals: Dict[str, Any] = {}
    exec(code, globals, locals)
    return locals

Including Path, platform, and a copy of os.environ in the globals means that a configuration file can be written without the overhead of import statements. This can make the settings simpler to prepare and maintain.

We've also removed four built-in functions: eval(), exec(), compile(), and __import__(). This will reduce the number of things a Python-language configuration file is capable of doing. This involves some fooling around inside the __builtins__ collection. This module behaves like a dictionary, but the type is not simply Dict[str, Any]. We've used the cast() function to tell mypy that the __builtins__.copy() method will work even though it's not obviously part of the module's type.

See also

  • See the Finding configuration files recipe earlier in this chapter to learn how to search multiple filesystem locations for a configuration file.

Using class-as-namespace for configuration

Python offers a variety of ways to package application inputs and configuration files. We'll continue to look at writing files in Python notation because it's elegant and the familiar syntax can lead to easy-to-read configuration files.

A number of projects allow us to use a class definition to provide configuration parameters. The use of a class hierarchy means that inheritance techniques can be used to simplify the organization of parameters. The Flask package, in particular, can do this. We looked at Flask in the Using the Flask framework for RESTful APIs recipe, and a number of related recipes.

In this recipe, we'll look at how we can represent configuration details in Python class notation.

Getting ready

Python notation for defining the attributes of a class can be simple, easy to read, and reasonably flexible. We can, with a little work, define a sophisticated configuration language that allows someone to change configuration parameters for a Python application quickly and reliably.

We can base this language on class definitions. This allows us to package a number of configuration alternatives in a single module. An application can load the module and pick the relevant class definition from the module.

We'd like to be able to provide definitions that look like this:

class Configuration:
    """
    Generic Configuration
    """
    url = {
        "scheme": "http", 
        "netloc": "forecast.weather.gov", 
        "path": "/shmrn.php"}
    query = {"mz": ["ANZ532"]}

We can create this class definition in a settings.py file to create a settings module. To use the configuration, the main application could do this:

from settings import Configuration

The application will gather the settings using the fixed module name of settings with a fixed class name of Configuration. We have two ways to add flexibility to using a module as a configuration file:

  • We can use the PYTHONPATH environment variable to list a number of locations for configuration modules
  • We can use multiple inheritance and mix in class definitions to combine defaults, system-wide settings, and localized settings into a configuration class definition

These techniques can be helpful because the configuration file locations follow Python's rules for finding modules. Rather than implementing our own search for the configuration, we can leverage Python's search of sys.path.

In this recipe, we'll build the missing part of the previous example, the load_config_file() function. Here's the template that needs to be filled in:

def load_config_file(
        config_path: Path, classname: str = "Configuration"
    ) -> Dict[str, Any]:
    """Loads a configuration mapping object with contents
    of a given file.
    :param config_path: Path to be read.
    :returns: mapping with configuration parameter values
    """
    # Details omitted.

We've used a similar template in a number of recipes in this chapter. For this recipe, we've added a parameter to this definition. The classname parameter is not present in previous recipes, but it is used here to select one of the many classes from a module at the location in the filesystem named by the config_path parameter.

How to do it...

We can make use of the pathlib module to locate the files. We'll leverage the built-in compile() and exec() functions to process the code in the configuration file. The result is not a dictionary, and isn't compatible with previous ChainMap-based configurations:

  1. Import the Path definition and the type hints required by the load_config_file() function definition:
    from pathlib import Path
    import platform
    from typing import Dict, Any, Type
    
  2. Since the point of this configuration is to return a class, we'll provide a type hint for any class definition:
    ConfigClass = Type[object]
    
  3. Use the built-in compile() function to compile the Python module into an executable form. This function requires the source text as well as a filename from which the text was read. The filename is essential for creating trace-back messages that are useful and correct:
    def load_config_file(
            config_path: Path, classname: str = "Configuration"
        ) -> ConfigClass:
        code = compile(
            config_path.read_text(), 
            config_path.name, 
            "exec")
    
  4. Execute the code object created by the compile() method. We need to provide two contexts. The global context can provide the __builtins__ module, plus the Path class and the platform module. The local context is where new variables will be created:
        globals = {
            "__builtins__": __builtins__, 
            "Path": Path, 
            "platform": platform}
        locals: Dict[str, ConfigClass] = {}
        exec(code, globals, locals)
        return locals[classname]
    

This locates the named class in the locals mapping. This mapping will have all the local variables set when the module was executed; these local variables will include all class and function definitions in addition to assigned variables. The value of locals[classname] will be the named class in the definitions created by the module that was executed.

How it works...

The details of the Python language—syntax and semantics—are embodied in the compile() and exec() functions. The exec() function reflects the way Python handles global and local variables. There are two namespaces provided to this function. The global namespace instance includes __builtins__ plus a class and module that might be used in the file.

The local variable namespace will have the new class created in it. The local namespace has a __dict__ attribute that makes it accessible via dictionary methods. Because of this, we can then extract the class by name using locals[classname]. The function returns the class object for use throughout the application.

We can put any kind of object into the attributes of a class. Our example showed mapping objects. There's no limitation on what can be done when creating attributes at the class level.

We can have complex calculations within the class statement. We can use this to create attributes that are derived from other attributes. We can execute any kind of statement, including if statements and for statements, to create attribute values.

We will not, however, ever create an instance of the class. Ordinary methods of the class will not be used. If a function-like definition is helpful, it would have to be decorated with @classmethod to be useful.

There's more...

Using a class definition means that we can leverage inheritance to organize the configuration values. We can easily create multiple subclasses of Configuration, one of which will be selected for use in the application. The configuration might look like this:

class Configuration:
    """
    Generic Configuration
    """
    url = {
        "scheme": "http", 
        "netloc": "forecast.weather.gov", 
        "path": "/shmrn.php"}
class Bahamas(Configuration):
    """
    Weather forecast for Offshore including the Bahamas
    """
    query = {"mz": ["AMZ117", "AMZ080"]}
class Chesapeake(Configuration):
    """
    Weather for Chesapeake Bay
    """
    query = {"mz": ["ANZ532"]}

This means that our application must choose an appropriate class from the available classes in the settings module. We might use an OS environment variable or a command-line option to specify the class name to use. The idea is that our program can be executed like this:

python3 some_app.py -c settings.Chesapeake

This would locate the Chesapeake class in the settings module. Processing would then be based on the details in that particular configuration class. This idea leads to an extension to the load_config_module() function.

In order to pick one of the available classes, we'll provide an additional parameter with the class name:

import importlib 
def load_config_module(name: str) -> ConfigClass:
    module_name, _, class_name = name.rpartition(".")
    settings_module = importlib.import_module(module_name)
    result: ConfigClass = vars(settings_module)[class_name]
    return result

Rather than manually compiling and executing the module, we've used the higher-level importlib module. This module implements the import statement semantics. The requested module is imported; compiled and executed; and the resulting module object is assigned to the variable named settings_module.

We can then look inside the module's variables and pick out the class that was requested. The vars() built-in function will extract the internal dictionary from a module, a class, or even the local variables.

Now we can use this function as follows:

>>> configuration = Chapter_13.ch13_r04.load_config_module(
...    'Chapter_13.settings.Chesapeake')
>>> configuration.__doc__.strip() 
'Weather for Chesapeake Bay' 
>>> configuration.query 
{'mz': ['ANZ532']} 
>>> configuration.url['netloc'] 
'forecast.weather.gov' 

We've located the Chesapeake configuration class in the settings module and extracted the various settings the application needs from this class.

Configuration representation

One consequence of using a class like this is the default display isn't very informative. When we try to print the configuration, it looks like this:

>>> print(configuration) 
<class 'settings.Chesapeake'> 

This isn't very helpful. It provides one nugget of information, but that's not nearly enough for debugging.

We can use the vars() function to see more details. However, this shows local variables, not inherited variables:

>>> pprint(vars(configuration))
mappingproxy({'__doc__': '
    Weather for Chesapeake Bay
    ',
              '__module__': 'Chapter_13.settings',
              'query': {'mz': ['ANZ532']}})

This is a little better, but it remains incomplete.

In order to see all of the settings, we need something a little more sophisticated. Interestingly, we can't simply define __repr__() for a class. A method defined in a class is used by the instances of this class, not the class itself.

Each class object we create is an instance of the built-in type class. We can, using a meta-class, tweak the way the type class behaves, and implement a slightly nicer __repr__() method, which looks through all parent classes for attributes.

We'll extend the built-in type with a __repr__ that does a somewhat better job at displaying the working configuration:

class ConfigMetaclass(type):
    """Displays a subclass with superclass values injected"""
    def __repr__(self) -> str:
        name = (
            super().__name__
            + "("
            + ", ".join(b.__name__ for b in super().__bases__)
            + ")"
        )
        base_values = {
            n: v
            for base in reversed(super().__mro__)
            for n, v in vars(base).items()
            if not n.startswith("_")
        }
        values_text = [f"class {name}:"] + [
            f"    {name} = {value!r}"
            for name, value in base_values.items()
        ]
        return "
".join(values_text)

The class name is available from the superclass, type, as the __name__ attribute. The names of the base classes are included as well, to show the inheritance hierarchy for this configuration class.

The base_values are built from the attributes of all of the base classes. Each class is examined in reverse Method Resolution Order (MRO). Loading all of the attribute values in reverse MRO means that all of the defaults are loaded first. These values are then overridden with subclass values.

Names with the _ prefix are quietly ignored. This emphasizes the conventional practice of treating these as implementation details that aren't part of a public interface. This kind of name shouldn't really be used for a configuration file.

The resulting values are used to create a text representation that resembles a class definition. This does not recreate the original class source code; it's the net effect of the original class definition and all the superclass definitions.

Here's a Configuration class hierarchy that uses this metaclass. The base class, Configuration, incorporates the metaclass, and provides default definitions. The subclass extends those definitions with values that are unique to a particular environment or context:

class Configuration(metaclass=ConfigMetaclass):
    unchanged = "default"
    override = "default"
    feature_x_override = "default"
    feature_x = "disabled"
class Customized(Configuration):
    override = "customized"
    feature_x_override = "x-customized"

This is the kind of output our meta-class provides:

>>> print(Customized)
class Customized(Configuration):
    unchanged = 'default'
    override = 'customized'
    feature_x_override = 'x-customized'
    feature_x = 'disabled'

The output here can make it a little easier to see how the subclass attributes override the superclass defaults. This can help to clarify the resulting configuration used by an application.

We can leverage all of the power of Python's multiple inheritance to build Configuration class definitions. This can provide the ability to combine details on separate features into a single configuration object.

See also

  • We'll look at class definitions in Chapter 7, Basics of Classes and Objects, and Chapter 8, More Advanced Class Design.

Designing scripts for composition

Many large applications are amalgamations of multiple smaller applications. In enterprise terminology, they are often called application systems comprising individual command-line application programs.

Some large, complex applications include a number of commands. For example, the Git application has numerous individual commands, such as git pull, git commit, and git push. These can also be seen as separate applications that are part of the overall Git system of applications.

An application might start as a collection of separate Python script files. At some point during its evolution, it can become necessary to refactor the scripts to combine features and create new, composite scripts from older disjoint scripts. The other path is also possible: a large application might be decomposed and refactored into a new organization of smaller components.

In this recipe, we'll look at ways to design a script so that future combinations and refactoring are made as simple as possible.

Getting ready

We need to distinguish between several aspects of a Python script.

We've seen several aspects of gathering input:

  • Getting highly dynamic input from a command-line interface and environment variables. See the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs.
  • Getting slower-changing configuration options from files. See the Finding configuration files, Using YAML for configuration files, and Using Python for configuration files recipes.
  • For reading any input file, see the Reading delimited files with the CSV module, Reading complex formats using regular expressions, Reading JSON documents, Reading XML documents, and Reading HTML documents recipes in Chapter 10, Input/Output, Physical Format, and Logical Layout.

There are several aspects to producing output:

  • Creating logs and offering other features that support audit, control, and monitoring. We'll look at some of this in the Using logging for control and audit output recipe.
  • Creating the main output of the application. This might be printed or written to an output file using some of the same library modules used to parse inputs.

And finally, there's the real work of the application. This is made up of the essential features disentangled from the various input parsing and output formatting considerations. The real work is an algorithm working exclusively with Python data structures.

This separation of concerns suggests that an application, no matter how simple, should be designed as several separate functions. These should then be combined into the complete script. This lets us separate the input and output from the core processing. The processing is the part we'll often want to reuse. The input and output formats should be easy to change.

As a concrete example, we'll look at an application that creates sequences of dice rolls. Each sequence will follow the rules of the game of Craps. Here are the rules:

  1. The first roll of two dice is the come out roll:
    • A roll of 2, 3, or 12 is an immediate loss. The sequence has a single value, for example, [(1, 1)].
    • A roll of 7 or 11 is an immediate win. This sequence also has a single value, for example, [(3, 4)].
  2. Any other number establishes a point. The sequence starts with the point value and continues until either a 7 or the point value is rolled:
    • A final 7 is a loss, for example, [(3, 1), (3, 2), (1, 1), (5, 6), (4, 3)].
    • A final match of the original point value is a win. There will be a minimum of two rolls. There's no upper bound on the length of a game, for example, [(3, 1), (3, 2), (1, 1), (5, 6), (1, 3)].

The output is a sequence of items. Each item has a different structure. Some will be short lists. Some will be long lists. This is an ideal place for using YAML format files.

This output can be controlled by two inputs—how many sample sequences to create, and whether or not to seed the random number generator. For testing purposes, it can help to have a fixed seed.

How to do it...

This recipe will involve a fair number of design decisions. We'll start by considering the different kinds of output. Then we'll refactor the application around the kinds of output and the different purposes for the output:

  1. Separate the output display into two broad areas:
    • Functions (or classes) that do no processing but display result objects. In this example, this is the sequence of throws for each individual game.
    • Logging used for monitoring and control, as well as audit or debugging. This is a cross-cutting concern that will be embedded throughout an application.

      The sequence of rolls needs to be written to a file. This suggests that the write_rolls() function is given an iterator as a parameter. Here's a function that iterates and dumps values to a file in YAML notation:

      def write_rolls(
              output_path: Path,
              game_iterator: Iterable[Game_Summary]
          ) -> Counter[int]:
          face_count: Counter[int] = collections.Counter()
          with output_path.open("w") as output_file:
              for game_outcome in game_iterator:
                  output_file.write(
                      yaml.dump(
                          game_outcome, 
                          default_flow_style=True, 
                          explicit_start=True
                      )
                  )
                  for roll in game_outcome:
                      face_count[sum(roll)] += 1
          return face_count
      
  2. The monitoring and control output should display the input parameters used to control the processing. It should also provide the counts that show that the dice were fair. As a general practice, this kind of extra control information, separate from the primary output, is often written to standard error:
    def summarize(
            configuration: argparse.Namespace, 
            counts: Counter[int]
        ) -> None:
        print(configuration, file=sys.stderr)
        print(counts, file=sys.stderr)
    
  3. Design (or refactor) the essential processing of the application to look like a single function:
    • All inputs are parameters.
    • All outputs are produced by return or yield. Use return to create a single result. Use yield to generate each item of an iterator that will produce multiple results.

      In this example, we can easily make the core feature a function that iterates over the interesting values. This generator function relies on a craps_game() function to generate the requested number of samples. Each sample is a full game, showing all of the dice rolls. The roll_iter() function provides the face_count counter to this lower-level function to accumulate some totals to confirm that everything worked properly.

      def roll_iter(
              total_games: int,
              seed: Optional[int] = None
          ) -> Iterator[Game_Summary]:
          random.seed(seed)
          for i in range(total_games):
              sequence = craps_game()
              yield sequence
      
  4. The craps_game() function implements the Craps game rules to emit a single sequence of one or more rolls. This comprises all the rolls in a single game. We'll look at this craps_game() function later.
  5. Refactor all of the input gathering into a function (or class) that gathers the various input sources. This can include environment variables, command-line arguments, and configuration files. It may also include the names of multiple input files. This function gathers command-line arguments. It also checks the os.environ collection of environment variables:
    def get_options(
            argv: List[str] = sys.argv[1:]
        ) -> argparse.Namespace:
    
  6. The argument parser will handle the details of parsing the –samples and –output options. We can leverage additional features of argparse to better validate the argument values:
        parser = argparse.ArgumentParser()
        parser.add_argument("-s", "--samples", type=int)
        parser.add_argument("-o", "--output")
        options = parser.parse_args(argv)
        if options.output is None:
            parser.error("No output file specified")
    
  7. The value of output_path is created from the value of the –output option. Similarly, the value of the RANDOMSEED environment variable is validated and placed into the options namespace. This use of the options object keeps all of the various arguments in one place:
        options.output_path = Path(options.output)
        if "RANDOMSEED" in os.environ:
            seed_text = os.environ["RANDOMSEED"]
            try:
                options.seed = int(seed_text)
            except ValueError:
                parser.error(
                    f"RANDOMSEED={seed_text!r} invalid seed")
        else:
            options.seed = None
        return options
    
  8. Write the overall main() function, which incorporates the three previous elements, to create the final, overall script:
    def main() -> None:
        options = get_options(sys.argv[1:])
        face_count = write_rolls(
            options.output_path, 
            roll_iter(
                options.samples, options.seed
             )
        )
        summarize(options, face_count)
    

This brings the various aspects of the application together. It parses the command-line and environment options.

The roll_iter() function is the core processing. It takes the various options, and it emits a sequence of rolls.

The primary output from the roll_iter() method is collected by write_rolls() and written to the given output path. Additional control output is written by a separate function, summarize(), so that we can change the summary without an impact on the primary output.

How it works...

The central premise here is the separation of concerns. There are three distinct aspects to the processing:

  • Inputs: The parameters from the command-line and environment variables are gathered by a single function, get_options(). This function can grab inputs from a variety of sources, including configuration files.
  • Outputs: The primary output was handled by the write_rolls() function. The other control output was handled by accumulating totals in a Counter object and then dumping this output at the end.
  • Process: The application's essential processing is factored into the roll_iter() function. This function can be reused in a variety of contexts.

The goal of this design is to separate the roll_iter() function from the surrounding application details.

The output from this application looks like the following example:

slott$ python Chapter_13/ch13_r05.py --samples 10 --output=x.yaml
Namespace(output='x.yaml', output_path=PosixPath('x.yaml'), samples=10, seed=None)
Counter({5: 7, 6: 7, 7: 7, 8: 5, 4: 4, 9: 4, 11: 3, 10: 1, 12: 1})

The command line requested ten samples and specifies an output file of x.yaml. The control output is a simple dump of the options. It shows the values for the parameters plus the additional values set in the options object.

The control output includes the counts from ten samples. This provides some confidence that values such as 6, 7, and 8 occur more often. It shows that values such as 3 and 12 occur less frequently.

The output file, x.yaml, might look like this:

slott$ more x.yaml
--- [[5, 4], [3, 4]]
--- [[3, 5], [1, 3], [1, 4], [5, 3]]
--- [[3, 2], [2, 4], [6, 5], [1, 6]]
--- [[2, 4], [3, 6], [5, 2]]
--- [[1, 6]]
--- [[1, 3], [4, 1], [1, 4], [5, 6], [6, 5], [1, 5], [2, 6], [3, 4]]
--- [[3, 3], [3, 4]]
--- [[3, 5], [4, 1], [4, 2], [3, 1], [1, 4], [2, 3], [2, 6]]
--- [[2, 2], [1, 5], [5, 5], [1, 5], [6, 6], [4, 3]]
--- [[4, 5], [6, 3]]

Consider the larger context for this kind of simulation. There might be one or more analytical applications to make use of the simulation output. These applications could perform some statistical analyses on the sequences of rolls.

After using these two applications to create rolls and summarize them, the users may determine that it would be advantageous to combine the roll creation and the statistical overview into a single application. Because the various aspects of each application have been separated, we can rearrange the features and create a new application.

We can now build a new application that will start with the following two imports to bring in the useful functions from the existing applications:

from generator import roll_iter, craps_rules 
from stats_overview import summarize 

Ideally, a new application can be built without any changes to the other two applications. This leaves the original suite of applications untouched by the introduction of new features.

More importantly, the new application did not involve any copying or pasting of code. The new application imports working software. Any changes made to fix one application will also fix latent bugs in other applications.

Reuse via copy and paste creates technical debt. Avoid copying and pasting the code.

When we try to copy code from one application and paste it into a new application, we create a confusing situation. Any changes made to one copy won't magically fix latent bugs in the other copy. When changes are made to one copy, and the other copy is not kept up to date, this is an example of code rot.

There's more...

In the previous section, we skipped over the details of the craps_rules() function. This function creates a sequence of dice rolls that comprise a single game of Craps. It can vary from a single roll to a sequence of indefinite length. About 98% of the games will consist of thirteen or fewer throws of the dice.

The rules depend on the total of two dice. The data captured include the two separate faces of the dice. In order to support these details, it's helpful to have a NamedTuple instance that has these two, related properties:

class Roll(NamedTuple):
    faces: List[int]
    total: int
def roll(n: int = 2) -> Roll:
    faces = list(random.randint(1, 6) for _ in range(n))
    total = sum(faces)
    return Roll(faces, total)

This roll() function creates a Roll instance with a sequence that shows the faces of the dice, as well as the total of the dice. The craps_game() function will generate enough Roll objects to be one complete game:

Game_Summary = List[List[int]]
def craps_game() -> Game_Summary:
    """Summarize the game as a list of dice pairs."""
    come_out = roll()
    if come_out.total in [2, 3, 12]:
        return [come_out.faces]
    elif come_out.total in [7, 11]:
        return [come_out.faces]
    elif come_out.total in [4, 5, 6, 8, 9, 10]:
        sequence = [come_out.faces]
        next = roll()
        while next.total not in [7, come_out.total]:
            sequence.append(next.faces)
            next = roll()
        sequence.append(next.faces)
        return sequence
    else:
        raise Exception(f"Horrifying Logic Bug in {come_out}")

The craps_game() function implements the rules for Craps. If the first roll is 2, 3, or 12, the sequence only has a single value, and the game is a loss. If the first roll is 7 or 11, the sequence also has only a single value, and the game is a win. The remaining values establish a point. The sequence of rolls starts with the point value. The sequence continues until it's ended by seven or the point value.

The horrifying logic bug exception represents a way to detect a design problem. The if statement conditions are quite complex. As we noted in the Designing complex if...elif chains recipe in Chapter 2, Statements and Syntax, we need to be absolutely sure the if and elif conditions are complete. If we've designed them incorrectly, the else statement should alert us to the failure to correctly design the conditions.

Refactoring a script to a class

The close relationship between the roll_iter(), roll(), and craps_game() methods suggests that it might be better to encapsulate these functions into a single class definition. Here's a class that has all of these features bundled together:

class CrapsSimulator:
    def __init__(self, /, seed: int = None) -> None:
        self.rng = random.Random(seed)
        self.faces: List[int]
        self.total: int
    def roll(self, n: int = 2) -> int:
        self.faces = list(
            self.rng.randint(1, 6) for _ in range(n))
        self.total = sum(self.faces)
        return self.total
    def craps_game(self) -> List[List[int]]:
        self.roll()
        if self.total in [2, 3, 12]:
            return [self.faces]
        elif self.total in [7, 11]:
            return [self.faces]
        elif self.total in [4, 5, 6, 8, 9, 10]:
            point, sequence = self.total, [self.faces]
            self.roll()
            while self.total not in [7, point]:
                sequence.append(self.faces)
                self.roll()
            sequence.append(self.faces)
            return sequence
        else:
            raise Exception("Horrifying Logic Bug")
    def roll_iter(
            self, total_games: int) -> Iterator[List[List[int]]]:
        for i in range(total_games):
            sequence = self.craps_game()
            yield sequence

This class includes an initialization of the simulator to include its own random number generator. It will either use the given seed value, or the internal algorithm will pick the seed value.

The roll() method will set the self.total and self.faces instance variables. There's no clear benefit to having the roll() method return a value and also cache the current value of the dice in the self.total attribute. Eliminating self.total is left as an exercise for the reader.

The craps_game() method generates one sequence of rolls for one game of Craps. It uses the roll() method and the two instance variables, self.total and self.faces, to track the state of the dice.

The roll_iter() method generates the sequence of games. Note that the signature of this method is not exactly like the preceding roll_iter() function. This class separates random number seeding from the game creation algorithm.

Rewriting the main() function to use the CrapsSimulator class is left as an exercise for the reader. Since the method names are similar to the original function names, the refactoring should not be terribly complex.

See also

  • See the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs, for background on using argparse to get inputs from a user.
  • See the Finding configuration files recipe earlier in this chapter for a way to track down configuration files.
  • The Using logging for control and audit output recipe later in this chapter looks at logging.
  • In the Combining two applications into one recipe, in Chapter 14, Application Integration: Combination, we'll look at ways to combine applications that follow this design pattern.

Using logging for control and audit output

In the Designing scripts for composition recipe earlier in this chapter, we examined three aspects of an application:

  • Gathering input
  • Producing output
  • The essential processing that connects input and output

There are several different kinds of output that applications produce:

  • The main output that helps a user make a decision or take action
  • Control information that confirms that the program worked completely and correctly
  • Audit summaries that are used to track the history of state changes in persistent databases
  • Any error messages that indicate why the application didn't work

It's less than optimal to lump all of these various aspects into print() requests that write to standard output. Indeed, it can lead to confusion because too many different outputs are interleaved in a single stream.

The OS provides each running process with two output files, standard output and standard error. These are visible in Python through the sys module with the names sys.stdout and sys.stderr. By default, the print() method writes to the sys.stdout file. We can change this and write the control, audit, and error messages to sys.stderr. This is an important step in the right direction.

Python also offers the logging package, which can be used to direct the ancillary output to a separate file (and/or other output channels, such as a database). It can also be used to format and filter that additional output.

In this chapter we'll look at good ways to use the logging module.

Getting ready

In the Designing scripts for composition recipe, earlier in this chapter, we looked at an application that produced a YAML file with the raw output of a simulation in it. In this recipe, we'll look at an application that consumes that raw data and produces some statistical summaries. We'll call this application overview_stats.py.

Following the design pattern of separating the input, output, and processing, we'll have an application, main(), that looks something like this:

def main(argv: List[str] = sys.argv[1:]) -> None:
    options = get_options(argv)
    if options.output is not None:
        report_path = Path(options.output)
        with report_path.open("w") as result_file:
            process_all_files(result_file, options.file)
    else:
        process_all_files(sys.stdout, options.file)

This function will get the options from various sources. If an output file is named, it will create the output file using a with statement context manager. This function will then process all of the command-line argument files as input from which statistics are gathered.

If no output file name is provided, this function will write to the sys.stdout file. This will display output that can be redirected using the OS shell's > operator to create a file.

The main() function relies on a process_all_files() function. The process_all_files() function will iterate through each of the argument files and gather statistics from that file. Here's what that function looks like:

def process_all_files(
        result_file: TextIO,
        file_paths: Iterable[Path]
    ) -> None:
    for source_path in file_paths:
        with source_path.open() as source_file:
            game_iter = yaml.load_all(
                source_file,
                Loader=yaml.SafeLoader)
            statistics = gather_stats(game_iter)
            result_file.write(
                yaml.dump(
                    dict(statistics),
                    explicit_start=True))

The process_all_files() function applies gather_stats() to each file in the file_names iterable. The resulting collection is written to the given result_file file.

The function shown here conflates processing and output in a design that is not ideal. We'll address this design flaw in the Combining two applications into one recipe.

The essential processing is in the gather_stats() function. Given a path to a file, this will read and summarize the games in that file. The resulting summary object can then be written as part of the overall display or, in this case, appended to a sequence of YAML-format summaries:

def gather_stats(
        game_iter: Iterable[List[List[int]]]
    ) -> Counter[Outcome]:
    counts: Counter[Outcome] = collections.Counter()
    for game in game_iter:
        if len(game) == 1 and sum(game[0]) in (2, 3, 12):
            outcome = "loss"
        elif len(game) == 1 and sum(game[0]) in (7, 11):
            outcome = "win"
        elif len(game) > 1 and sum(game[-1]) == 7:
            outcome = "loss"
        elif len(game) > 1 and sum(game[0]) == sum(game[-1]):
            outcome = "win"
        else:
            detail_log.error("problem with %r", game)
            raise Exception(
                f"Wait, What? "
                f"Inconsistent len {len(game)} and 
                f"final {sum(game[-1])} roll"
            )
        event = (outcome, len(game))
        counts[event] += 1
    return counts

This function determines which of the four game termination rules were applied to the sequence of dice rolls. It starts by opening the given source file and using the load_all() function to iterate through all of the YAML documents. Each document is a single game, represented as a sequence of dice pairs.

This function uses the first (and sometimes last) rolls to determine the overall outcome of the game. There are four rules, which should enumerate all possible logical combinations of events. In the event that there is an error in our reasoning, an exception will get raised to alert us to a special case that didn't fit the design in some way.

The game is reduced to a single event with an outcome and a length. These are accumulated into a Counter object. The outcome and length of a game are the two values we're computing. These are a stand-in for more complex or sophisticated statistical analyses that are possible.

We've carefully segregated almost all file-related considerations from this function. The gather_stats() function will work with any iterable source of game data.

Here's the output from this application. It's not very pretty; it's a YAML document that can be used for further processing:

slott$ python Chapter_13/ch13_r06.py x.yaml
---
? !!python/tuple [loss, 2]
: 2
? !!python/tuple [loss, 3]
: 1
? !!python/tuple [loss, 4]
: 1
? !!python/tuple [loss, 6]
: 1
? !!python/tuple [loss, 8]
: 1
? !!python/tuple [win, 1]
: 1
? !!python/tuple [win, 2]
: 1
? !!python/tuple [win, 4]
: 1
? !!python/tuple [win, 7]
: 1

We'll need to insert logging features into all of these functions to show which file is being read, and any errors or problems with processing the file.

Furthermore, we're going to create two logs. One will have details, and the other will have a minimal summary of files that are created. The first log can go to sys.stderr, which will be displayed at the console when the program runs. The other log will be appended to a long-term log file to cover all uses of the application.

One approach to having separate needs is to create two loggers, each with a different intent. The two loggers can have dramatically different configurations. Another approach is to create a single logger and use a Filter object to distinguish content intended for each logger. We'll focus on creating separate loggers because it's easier to develop and easier to unit test.

Each logger has a variety of methods reflecting the severity of the message. The severity levels defined in the logging package include the following:

  • DEBUG: These messages are not generally shown since their intent is to support debugging.
  • INFO: These messages provide information on the normal, happy-path processing.
  • WARNING: These messages indicate that processing may be compromised in some way. The most sensible use case for a warning is when functions or classes have been deprecated: they still work, but they should be replaced.
  • ERROR: Processing is invalid and the output is incorrect or incomplete. In the case of a long-running server, an individual request may have problems, but the server as a whole can continue to operate.
  • CRITICAL: A more severe level of error. Generally, this is used by long-running servers where the server itself can no longer operate and is about to crash.

The method names are similar to the severity levels. We use logging.info() to write an INFO message.

How to do it...

We'll be building a more complete application, leveraging components from previous examples. This will add use of the logging module:

  1. We'll start by implementing basic logging features into the existing functions. This means that we'll need the logging module, plus the other packages required by this app:
    import argparse
    import collections
    import logging
    from pathlib import Path
    import sys
    from typing import List, Iterable, Tuple, Counter, TextIO
    import yaml
    
  2. We'll create two logger objects as module globals. Loggers have hierarchical names. We'll name the loggers using the application name and a suffix with the content. The overview_stats.detail logger will have processing details. The overview_stats.write logger will identify the files read and the files written; this parallels the idea of an audit log because the file writes track state changes in the collection of output files:
    detail_log = logging.getLogger("overview_stats.detail")
    write_log = logging.getLogger("overview_stats.write")
    

    We don't need to configure these loggers at this time. If we do nothing more, the two logger objects will silently accept individual log entries, but won't do anything further with the data.

  3. We'll rewrite the main() function to summarize the two aspects of the processing. This will use the write_log logger object to show when a new file is created. We've added the write_log.info() line to put an information message into the log for files that have been written:
    def main(argv: List[str] = sys.argv[1:]) -> None:
        options = get_options(argv)
        if options.output is not None:
            report_path = Path(options.output)
            with report_path.open("w") as result_file:
                process_all_files(result_file, options.file)
            write_log.info("wrote %r", report_path)
        else:
            process_all_files(sys.stdout, options.file)
    
  4. We'll rewrite the process_all_files() function to provide a note when a file is read. We've added the detail_log.info() line to put information messages in the detail log for every file that's read:
    def process_all_files(
            result_file: TextIO,
            file_paths: Iterable[Path]
        ) -> None:
        for source_path in file_paths:
            detail_log.info("read %r", source_path)
            with source_path.open() as source_file:
                game_iter = yaml.load_all(
                    source_file,
                    Loader=yaml.SafeLoader)
                statistics = gather_stats(game_iter)
                result_file.write(
                    yaml.dump(
                        dict(statistics),
                        explicit_start=True))
    

    The gather_stats() function can have a log line added to it to track normal operations. Additionally, we've added a log entry for the logic error. The detail_log logger is used to collect debugging information. If we set the overall logging level to include debug messages, we'll see this additional output:

    def gather_stats(
            game_iter: Iterable[List[List[int]]]
        ) -> Counter[Outcome]:
        counts: Counter[Outcome] = collections.Counter()
        for game in game_iter:
            if len(game) == 1 and sum(game[0]) in (2, 3, 12):
                outcome = "loss"
            elif len(game) == 1 and sum(game[0]) in (7, 11):
                outcome = "win"
            elif len(game) > 1 and sum(game[-1]) == 7:
                outcome = "loss"
            elif (len(game) > 1 
                  and sum(game[0]) == sum(game[-1])):
                outcome = "win"
            else:
                detail_log.error("problem with %r", game)
                raise Exception("Wait, What?")
            event = (outcome, len(game))
            detail_log.debug(
                "game %r -> event %r", game, event)
            counts[event] += 1
        return counts
    
  5. The get_options() function will also have a debugging line written. This can help diagnose problems by displaying the options in the log:
    def get_options(
            argv: List[str] = sys.argv[1:]
        ) -> argparse.Namespace:
        parser = argparse.ArgumentParser()
        parser.add_argument("file", nargs="*", type=Path)
        parser.add_argument("-o", "--output")
        options = parser.parse_args(argv)
        detail_log.debug("options: %r", options)
        return options
    
  6. We can add a basic configuration to see the log entries. This works as a first step to confirm that there are two loggers and they're being used properly:
    if __name__ == "__main__": 
        logging.basicConfig(stream=sys.stderr, level=logging.INFO)
        main()
        logging.shutdown()
    

This logging configuration builds the default handler object. This object simply prints all of the log messages on the given stream. This handler is assigned to the root logger; it will apply to all children of this logger. Therefore, both of the loggers created in the preceding code will go to the same stream.

Here's an example of running this script:

(cookbook) % python Chapter_13/ch13_r06.py -o data/sum.yaml data/x.yaml 
INFO:overview_stats.detail:read PosixPath('data/x.yaml')
INFO:overview_stats.write:wrote PosixPath('data/sum.yaml')

There are two lines in the log. Both have a severity of INFO. The first line is from the overview_stats.detail logger. The second line is from the overview_stats.write logger. The default configuration sends all loggers to sys.stderr so the logging output is kept separate from the main output of the application.

How it works...

There are three parts to introducing logging into an application:

  • Creating Logger objects
  • Placing log requests near important state changes
  • Configuring the logging system as a whole

Creating loggers can be done in a variety of ways. A common approach is to create one logger with the same name as the module:

logger = logging.getLogger(__name__) 

For the top-level, main script, this will have the name "__main__". For imported modules, the name will match the module name.

In more complex applications, there will be a variety of loggers serving a variety of purposes. In these cases, simply naming a logger after a module may not provide the required level of flexibility.

It's also possible to use the logging module itself as the root logger. This means a module can use the logging.info() method, for example. This isn't recommended because the root logger is anonymous, and we sacrifice the possibility of using the logger name as an important source of information.

There are two concepts that can be used to assign names to the loggers. It's often best to choose one of them and stick with it throughout a large application:

  • Follow the package and module hierarchy. This means that a logger specific to a class might have a name like package.module.class. Other classes in the same module would share a common parent logger name. It's then possible to set the logging level for the whole package, one of the specific modules, or just one of the classes.
  • Follow a hierarchy based on the audience or use case. The top-level name will distinguish the audience or purpose for the log. We might have top-level loggers with names such as event, audit, and perhaps debug. This way, all of the audit loggers will have names that start with "audit.". This can make it easy to route all loggers under a given parent to a specific handler.

In the recipe, we used the first style of naming. The logger names parallel the software architecture.

Placing logging requests near all the important state changes means we can decide which of the interesting state changes in an application belong in a log:

  • Any change to a persistent resource might be a good place to include a message of level INFO. This means any change to the OS state, for example removing a file or creating a directory, is a candidate for logging. Similarly, database updates and requests that should change the state of a web service should be logged.
  • Whenever there's a problem making a persistent state change, there should be a message with a level of ERROR. Any OS-level exceptions can be logged when they are caught and handled.
  • In long, complex calculations, it may be helpful to include DEBUG messages after particularly important assignment statements.
  • Any change to an internal application resource deserves a DEBUG message so that object state changes can be tracked through the log.
  • When the application enters an erroneous state. This should generally be in an exception handler. When exceptions are being silenced or transformed, then a DEBUG message might be more appropriate than a log entry at the CRITICAL level.

The third aspect of logging is configuring the loggers so that they route the requests to the appropriate destination. By default, with no configuration at all, the loggers will all quietly create log events but won't display them.

With minimal configuration, we can see all of the log events on the console. This can be done with the basicConfig() method and covers a large number of simple use cases without any real fuss. Instead of a stream, we can use a filename to provide a named file. Perhaps the most important feature is providing a simple way to enable debugging by setting the logging level on the root logger from the basicConfig() method.

The example configuration in the recipe used two common handlers—the StreamHandler and FileHandler classes. There are over a dozen more handlers, each with unique features for gathering and publishing log messages.

There's more...

In order to route the different loggers to different destinations, we'll need a more sophisticated configuration. This goes beyond what we can build with the basicConfig() function. We'll need to use the logging.config module, and the dictConfig() function. This can provide a complete set of configuration options. The easiest way to use this function is to write the configuration in YAML and then convert this to an internal dict object using the yaml.load() function:

from textwrap import dedent
config_yaml = dedent("""
version: 1
formatters:
    default:
        style: "{"
        format: "{levelname}:{name}:{message}"
        #   Example: INFO:overview_stats.detail:read x.yaml
    timestamp:
        style: "{"
        format: "{asctime}//{levelname}//{name}//{message}"
handlers:
    console:
        class: logging.StreamHandler
        stream: ext://sys.stderr
        formatter: default
    file:
        class: logging.FileHandler
        filename: data/write.log
        formatter: timestamp
loggers:
    overview_stats.detail:
        handlers:
        -   console
    overview_stats.write:
        handlers:
        -   file
        -   console
root:
    level: INFO
""")

The YAML document is enclosed in a triple-quoted string. This allows us to write as much text as necessary. We've defined five things in the big block of text using YAML notation:

  • The value of the version key must be 1.
  • The value of the formatters key defines the log format. If this is not specified, the default format shows only the message body, without any level or logger information:
  • The default formatter defined here mirrors the format created by the basicConfig() function.
  • The timestamp formatter defined here is a more complex format that includes the datetime stamp for the record. To make the file easier to parse, a column separator of // was used.
  • The handlers key defines the two handlers for the two loggers. The console handler writes to the sys.stderr stream. We specified the formatter this handler will use. This definition parallels the configuration created by the basicConfig() function. Unsurprisingly, the FileHandler class writes to a file. The default mode for opening the file is a, which will append to the file with no upper limit on the file size. There are other handlers that can rotate through multiple files, each of a limited size. We've provided an explicit filename, and the formatter that will put more detail into the file than is shown on the console.
  • The loggers key provides a configuration for the two loggers that the application will create. Any logger name that begins with overview_stats.detail will be handled only by the console handler. Any logger name that begins with overview_stats.write will go to both the file handler and the console handler.
  • The root key defines the top-level logger. It has a name of '' (the empty string) in case we need to refer to it in code. Setting the level on the root logger will set the level for all of the children of this logger.

Use the configuration to wrap the main() function like this:

logging.config.dictConfig(
    yaml.load(config_yaml, Loader=yaml.SafeLoader))
main()
logging.shutdown()

This will start the logging in a known state. It will do the processing of the application. It will finalize all of the logging buffers and properly close any files.

See also

  • See the Designing scripts for composition recipe earlier in this chapter for the complementary part of this application.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.13.173