14

Application Integration: Combination

The Python language is designed to permit extensibility. We can create sophisticated programs by combining a number of smaller components. In this chapter, we'll look at ways to take a number of smaller components and create sophisticated combinations.

We'll look at the complications that can arise from composite applications and the need to centralize some features, like command-line parsing. This will enable us to create uniform interfaces for a variety of closely related programs.

We'll extend some of the concepts from Chapter 7, Basics of Classes and Objects, and Chapter 8, More Advanced Class Design, and apply the idea of the Command Design Pattern to Python programs. By encapsulating features in class definitions, we'll find it easier to combine features.

In this chapter, we'll look at the following recipes:

  • Combining two applications into one
  • Combining many applications using the Command Design Pattern
  • Managing arguments and configuration in composite applications
  • Wrapping and combining CLI applications
  • Wrapping a program and checking the output
  • Controlling complex sequences of steps

We'll start with a direct approach to combining multiple Python applications into a single, more sophisticated and useful application. We'll expand this to apply object-oriented design techniques and create an even more flexible composite. The next layer is to create uniform command-line argument parsing for composite applications.

Combining two applications into one

In the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration, we looked at a simple application that creates a collection of statistics by simulating a process. In the Using logging for control and audit output recipe in Chapter 13, Application Integration: Configuration, we looked at an application that summarizes a collection of statistics. In this recipe, we'll combine the separate simulation and summarizing applications to create a single, composite application that performs both a simulation and summarizes the resulting data.

There are several common approaches to combining multiple applications:

  • A shell script can run the simulator and then run the summary.
  • A Python program can stand in for the shell script and use the runpy module to run each program.
  • We can build a composite application from the essential components of each application.

In the Designing scripts for composition recipe, we examined three aspects of an application. Here are the three aspects that many applications implement:

  • Gathering input
  • Producing output
  • The essential processing that connects input and output

This separation of concerns can be helpful for suggesting how components can be selected from multiple applications and recombined into a new, larger application.

In this recipe, we'll look at a direct way to combine applications by writing a Python application that treats other applications as separate modules.

Getting ready

In the Designing scripts for composition and Using logging for control and audit output recipes in Chapter 13, Application Integration: Configuration, we followed a design pattern that separated the input gathering, the essential processing, and the production of output. The objective behind that design pattern was gathering the interesting pieces together to combine and recombine them into higher-level constructs.

Note that we have a tiny mismatch between the two applications, we can borrow a phrase from database engineering (and also electrical engineering) and call this an impedance mismatch. In electrical engineering, it's a problem with circuit design, and it's often solved by using a device called a transformer to match the impedance between circuit components.

In database engineering, this kind of problem surfaces when the database has normalized, flat data, but the programming language uses richly structured complex objects. For SQL databases, this is a common problem, and packages such as SQLAlchemy are used as an Object-Relational Management (ORM) layer. This layer is a transformer between flat database rows (often from multiple tables) and complex Python structures.

When building a composite application, the impedance mismatch that surfaces in this example is a cardinality problem. The simulator is designed to run more frequently than the statistical summarizer. We have several choices for addressing issues such as this one:

  • Total Redesign: This may not be a sensible alternative because the two component applications have an established base of users. In other cases, the new use cases are an opportunity to make sweeping fixes and retire some technical debt.
  • Include the Iterator: This means that when we build the composite application, we'll add a for statement to perform many simulation runs and then process this into a single summary. This parallels the original design intent.
  • List of One: This means that the composite application will run one simulation and provide this single simulation output to the summarizer. This doesn't follow the original intent well, but it has the advantage of simplicity.

The choice between these design alternatives depends on the user story that leads to creating the composite application in the first place. It may also depend on the established base of users. For our purposes, we'll assume that the users have come to realize that 1,000 simulation runs of 1,000 samples is now their standard approach, and they would like to follow the Include the Iterator design to create a composite process.

How to do it...

We'll follow a design pattern that decomposes a complex process into functions that are independent of input or output details. See the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration, for details on this.

  1. Import the essential functions from the working modules. In this case, the two modules have the relatively uninteresting names ch13_r05 and ch13_r06:
    from Chapter_13.ch13_r05 import roll_iter
    from Chapter_13.ch13_r06 import gather_stats, Outcome
    
  2. Import any other modules required. We'll use a Counter collection to prepare the summaries in this example:
    import argparse
    import collections
    import logging
    import time
    import sys
    from typing import List, Counter, Tuple, Iterable, Dict
    
  3. Create a new function that combines the existing functions from the other applications. The output from one function is input to another:
    def summarize_games(
            total_games: int, *, seed: int = None
        ) -> Counter[Outcome]:
        game_statistics = gather_stats(
            roll_iter(total_games, seed=seed))
        return game_statistics
    
  4. Write the output-formatting functions that use this composite process. Here, for example, is a composite process that exercises the summarize_games() function. This also writes the output report:
    def simple_composite(
            games: int = 100, rolls: int = 1_000) -> None:
        start = time.perf_counter()
        stats = summarize_games(games*rolls)
        end = time.perf_counter()
        games = sum(stats.values())
        print("games", games, "rolls", rolls)
        print(win_loss(stats))
        print(f"serial: {end-start:.2f} seconds")
    
  5. Gathering command-line options can be done using the argparse module. There are examples of this in recipes such as the Designing scripts for composition recipe.

The combined functionality is now a function, simple_composite(), that we can invoke from a block of code like the following:

if __name__ == "__main__":
    logging.basicConfig(stream=sys.stderr, level=logging.INFO)
    simple_composite(games=1000, rolls=1000)

This gives us a combined application, written entirely in Python. We can write unit tests for this composite, as well as each of the individual steps that make up the overall application.

How it works...

The central feature of this design is the separation of the various concerns of the application into isolated functions or classes. The two component applications started with a design divided up among input, process, and output concerns. Starting from this base made it easier to import and reuse the processing. This also left the two original applications in place, unchanged.

The objective is to import functions from working modules and avoid copy-and-paste programming. Copying a function from one file and pasting it into another means that any change made to one is unlikely to be made to the other. The two copies will slowly diverge, leading to a phenomenon sometimes called code rot.

When a class or function does several things, the reuse potential is reduced. This leads to the Inverse power law of reuse—the re usability of a class or function, R(c), is related to the inverse of the number of features in that class or function, F(c):

A single feature aids reuse. Multiple features reduce the opportunities for reuse of a component.

When we look at the two original applications from the Designing scripts for composition and Using logging for control and audit output recipes in Chapter 13, Application Integration: Configuration, we can see that the essential functions had few features. The roll_iter() function simulated a game and yielded results. The gather_stats() function gathered statistics from any source of data.

The idea of counting features depends, of course, on the level of abstraction. From a small-scale view, the functions do many small things. From a very large-scale view, the functions require several helpers to form a complete application; from this viewpoint, an individual function is only a part of a feature.

In this case, one application created files. The second application summarized files. Feedback from users may have revealed that the distinction was not important or perhaps confusing. This led to a redesign to combine the two original steps into a one-step operation.

There's more...

We'll look at three additional areas of rework of the application:

  • Structure: The Combining two applications into one recipe did not do a good job of distinguishing between processing aspects and output aspects. When trying to create a composite application, we may need to refactor the component modules to look for better organization of the features.
  • Performance: Running several roll_iter() instances in parallel to use multiple cores.
  • Logging: When multiple applications are combined, the combined logging can become complex. When we need to observe the operations of the program for auditing and debugging purposes, we may need to refactor the logging.

We'll go through each area in turn.

Structure

In some cases, it becomes necessary to rearrange software to expose useful features. In one of the components, the ch13_r06 module, the process_all_files() function seemed to do too much.

This function combined source file iteration, detailed processing, and output creation in one place. The result_file.write() output processing was a single, complex statement that seemed unrelated to gathering and summarizing data.

In order to reuse this file-writing feature between two distinct applications, we'll need to refactor the ch13_r06 application so that the file output is not buried inside the process_all_files() function.

One line of code, result_file.write(...), needs to be replaced with a separate function. This is a small change. The details are left as an exercise for the reader. When the output operation is defined as a separate function, it is easier to change to new physical formats or logical layouts of data.

This refactoring also makes the new function available for other composite applications. When multiple applications share a common function, then it's much more likely that outputs between the applications are actually compatible.

Performance

Running many simulations followed by a single summary is a kind of map-reduce design. The detailed simulations are a kind of mapping that creates raw data. These can be run concurrently, using multiple cores and multiple processors. The final summary is created from all of the simulations via a statistical reduction.

We often use OS features to run multiple concurrent processes. The POSIX shells include the & operator, which can be used to fork concurrent subprocesses. Windows has a start command, which is similar to the POSIX & operator. We can leverage Python directly to spawn a number of concurrent simulation processes.

One module for doing this is the futures module from the concurrent package. We can build a parallel simulation processor by creating an instance of ProcessPoolExecutor. We can submit requests to this executor and then collect the results from those concurrent requests:

from concurrent import futures
def parallel_composite(
        games: int = 100,
        rolls: int = 1_000,
        workers: Optional[int] = None) -> None:
    start = time.perf_counter()
    total_stats: Counter[Outcome] = collections.Counter()
    worker_list = []
    with futures.ProcessPoolExecutor(max_workers=workers) as executor:
        for i in range(games):
            worker_list.append(
                executor.submit(summarize_games, rolls))
        for worker in worker_list:
            stats = worker.result()
            total_stats.update(stats)
    end = time.perf_counter()
    games = sum(total_stats.values())
    print("games", games, "rolls", rolls)
    print(win_loss(total_stats))
    if workers is None:
        workers = multiprocessing.cpu_count()
    print(f"parallel ({workers}): {end-start:.2f} seconds")

We've initialized three objects: start, total_stats, and worker_list. The start object has the time at which processing started; time.perf_counter() is often the most accurate timer available. total_stats is a Counter object that will collect the final statistical summary. worker_list will be a list of individual Future objects, one for each request that's made.

The futures.ProcessPoolExecutor method creates a processing context in which a pool of workers is available to handle requests. By default, the pool has as many workers as the number of processors. Each worker process will import the module that creates the pool. All functions and classes defined in that module are available to the workers.

The submit() method of an executor is given a function to execute along with arguments to that function. In this example, there will be 100 requests made, each of which will simulate 1,000 games and return the sequence of dice rolls for those games. submit() returns a Future object, which is a model for the working request.

After submitting all 100 requests, the results are collected. The result() method of a Future object waits for the processing to finish and gathers the resulting object. In this example, the result is a statistical summary of 1,000 games. These are then combined to create the overall total_stats summary.

Here's a comparison of serial and parallel execution on a four-core processor:

(cookbook) % export PYTHONPATH=.
(cookbook) slott@MacBookPro-SLott Modern-Python-Cookbook-Second-Edition % python Chapter_14/ch14_r01.py --rolls 10_000 --serial
games 1000000 rolls 10000
Counter({'loss': 507921, 'win': 492079})
serial: 13.53 seconds
(cookbook) slott@MacBookPro-SLott Modern-Python-Cookbook-Second-Edition % python Chapter_14/ch14_r01.py --rolls 10_000 --parallel
games 1000000 rolls 10000
Counter({'loss': 506671, 'win': 493329})
parallel: 8.15 seconds

Concurrent simulation cuts the elapsed processing time from 13.53 seconds to 8.15 seconds. The runtime improved by 40%. Since there are four cores in the processing pool for concurrent requests, why isn't the time cut to 1/4th of the original time, or 3.38 seconds?

There is considerable overhead in spawning the subprocesses, communicating the request data, and collecting the result data from those subprocesses. It's interesting to create more workers than CPUs to see if this improves performance. It's also interesting to switch from ProcessPoolExecutor to ThreadPoolExector to see which offers the best performance for this specific workload.

Logging

In the Using logging for control and audit output recipe in Chapter 13, Application Integration: Configuration, we looked at how to use the logging module for control, audit, and error outputs. When we build a composite application, we'll have to combine the logging features from each of the original applications.

Logging involves a three-part recipe:

  1. Creating logger objects. This is generally a line such as logger = logging.get_logger('some_name'). It's generally done once at the class or module level.
  2. Using the logger objects to collect events. This involves lines such as logger.info('some message'). These lines are scattered throughout an application.
  3. Configuring the logging system as a whole. While this is not required, it is simplest when logging configuration is done only at the outermost, global scope of the application. This makes it easy to ignore when building composite applications. Ideally, it looks like this:
    if __name__ == "__main__":
        # logging configuration should only go here.
        main()
        logging.shutdown()
    

When creating composite applications, we may wind up with multiple logging configurations. There are two approaches that a composite application can follow:

  • The composite application is built with a final configuration, which intentionally overwrites all previously-defined loggers. This is the default behavior and can be stated explicitly via incremental: false in a YAML configuration document.
  • The composite application preserves other application loggers and merely modifies the logger configurations, perhaps by setting the overall level. This is not the default behavior and requires including incremental: true in the YAML configuration document.

The use of incremental configuration can be helpful when combining Python applications that don't isolate the logging configuration. It can take some time to read and understand the code from each application in order to properly configure logging for composite applications to avoid duplicating data among the various logs from the various components of the application.

See also

  • In the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration, we looked at the core design pattern for a composable application.

Combining many applications using the Command Design Pattern

Many complex suites of applications follow a design pattern similar to the one used by the Git program. There's a base command, git, with a number of subcommands. For example, git pull, git commit, and git push.

What's central to this design is the idea of a collection of individual commands under a common parent command. Each of the various features of Git can be thought of as a separate class definition that performs a given function.

In this recipe, we'll see how we can create families of closely related commands.

Getting ready

We'll imagine an application built from three commands. This is based on the applications shown in the Designing scripts for composition and Using logging for control and audit output recipes in Chapter 13, Application Integration: Configuration, as well as the Combining two applications into one recipe from earlier in this chapter. We'll have three applications—simulate, summarize, and a combined application called simsum.

These features are based on modules with names such as ch13_r05, ch13_r06, and ch14_r01. The idea is that we can restructure these separate modules into a single class hierarchy following the Command Design Pattern.

There are two key ingredients to this design:

  1. The client depends only on the abstract superclass, Command.
  2. Each individual subclass of the Command superclass has an identical interface. We can substitute any one of them for any other.

When we've done this, then an overall application script can create and execute any one of the Command subclasses.

How to do it...

We'll start by creating a superclass for all of the related commands. We'll then extend that superclass for each specific command that is part of the overall application.

  1. The overall application will have a structure that attempts to separate the features into two categories—argument parsing and command execution. Each subcommand will include both processing and the output bundled together. We're going to rely on argparse.Namespace to provide a very flexible collection of options to each subclass. This is not required but will be helpful in the Managing arguments and configuration in composite applications recipe later in this chapter. Here's the Command superclass:
    import argparse
    class Command:
        def __init__(self) -> None:
            pass
        def execute(self, options: argparse.Namespace) -> None:
            pass
    
  2. Create a subclass of the Command superclass for the Simulate (command) class. This will wrap the processing and output from the ch13_r05 module in the execute() method of this class:
    import Chapter_13.ch13_r05 as ch13_r05
    class Simulate(Command):
        def __init__(self) -> None:
            super().__init__()
            self.seed: Optional[Any] = None
            self.game_path: Path
        def execute(self, options: argparse.Namespace) -> None:
            self.game_path = Path(options.game_file)
            if 'seed' in options:
                self.seed = options.seed
            data = ch13_r05.roll_iter(options.games, self.seed)
            ch13_r05.write_rolls(self.game_path, data)
            print(f"Created {str(self.game_path)}")
    
  3. Create a subclass of the Command superclass for the Summarize (command) class. For this class, we've wrapped the file creation and the file processing into the execute() method of the class:
    import Chapter_13.ch13_r06 as ch13_r06
    class Summarize(Command):
        def execute(self, options: argparse.Namespace) -> None:
            self.summary_path = Path(options.summary_file)
            with self.summary_path.open("w") as result_file:
                game_paths = [Path(f) for f in options.game_files]
                ch13_r06.process_all_files(result_file, game_paths)
    
  4. The overall composite processing can be performed by the following main() function:
    def main() -> None:
        options_1 = Namespace(
            games=100, game_file="x.yaml")
        command1 = Simulate()
        command1.execute(options_1)
        options_2 = Namespace(
            summary_file="y.yaml", game_files=["x.yaml"])
        command2 = Summarize()
        command2.execute(options_2)
    

We've created two commands, an instance of the Simulate class, and an instance of the Summarize class. These can be executed to provide a combined feature that both simulates and summarizes data.

How it works...

Creating interchangeable, polymorphic classes for the various subcommands is a handy way to provide an extensible design. The Command Design Pattern strongly encourages each individual subclass to have an identical signature. Doing this makes it easier for the command subclasses to be created and executed. Also, new commands can be added that fit the framework.

One of the SOLID design principles is the Liskov Substitution Principle (LSP). Any of the subclasses of the Command abstract class can be used in place of the parent class.

Each Command instance has a simple interface. There are two features:

  • The __init__() method expects a Namespace object that's created by the argument parser. Each class will pick only the needed values from this namespace, ignoring any others. This allows some global arguments to be ignored by a subcommand that doesn't require them.
  • The execute() method does the processing and writes any output. This is based entirely on the values provided during initialization.

The use of the Command Design Pattern makes it easy to be sure that Command subclasses can be interchanged with each other. The overall main() script can create instances of the Simulate or Summarize classes. The substitution principle means that either instance can be executed because the interfaces are the same. This flexibility makes it easy to parse the command-line options and create an instance of either of the available classes. We can extend this idea and create sequences of individual command instances.

There's more...

One of the more common extensions to this design pattern is to provide for composite commands. In the Combining two applications into one recipe, we showed one way to create composites. This is another way, based on defining a new command that implements a combination of existing commands:

class Sequence(Command):
    def __init__(self, *commands: Type[Command]) -> None:
        super().__init__()
        self.commands = [command() for command in commands]
    def execute(self, options: argparse.Namespace) -> None:
        for command in self.commands:
            command.execute(options) 

This class will accept other Command classes via the *commands parameter. This sequence will combine all of the positional argument values. From the classes, it will build the individual class instances.

We might use this Sequence class like this:

options = Namespace(
    games=100, 
    game_file="x.yaml", 
    summary_file="y.yaml", 
    game_files=["x.yaml"]
)
both_command = Sequence(Simulate, Summarize)
both_command.execute(options)

We created an instance of Sequence built from two other classes—Simulate and Summarize. The __init__() method will build an internal sequence of the two objects. The execute() method of the sim_sum_command object will then perform the two processing steps in sequence.

This design, while simple, exposes some implementation details. In particular, the two class names and the intermediate x.yaml file are details that can be encapsulated into a better class design.

We can create a slightly nicer subclass of the Sequence argument if we focus specifically on the two commands being combined. This will have an __init__() method that follows the pattern of other Command subclasses:

class SimSum(Sequence):
    def __init__(self) -> None:
        super().__init__(Simulate, Summarize)
    def execute(self, options: argparse.Namespace) -> None:
        self.intermediate = (
            Path("data") / "ch14_r02_temporary.yaml"
        )
        new_namespace = Namespace(
            game_file=str(self.intermediate),
            game_files=[str(self.intermediate)],
            **vars(options)
        )
        super().execute(new_namespace)

This class definition incorporates two other classes into the already defined Sequence class structure. super().__init__() invokes the parent class initialization with the Simulate and Summarize classes.

This provides a composite application definition that conceals the details of how a file is used to pass data from the first step to a subsequent step. This is purely a feature of the composite integration and doesn't lead to any changes in either of the original applications that form the composite.

See also

  • In the Designing scripts for composition and Using logging for control and audit output recipes in Chapter 13, Application Integration: Configuration, we looked at the constituent parts of this composite application.
  • In the Combining two applications into one recipe earlier in this chapter, we looked at the constituent parts of this composite application. In most cases, we'll need to combine elements of all of these recipes to create a useful application.
  • We'll often need to follow the Managing arguments and configuration in composite applications recipe, which comes next in this chapter.

Managing arguments and configuration in composite applications

When we have a complex suite (or system) of individual applications, it's common for several applications to share common features. When we have completely separate applications, external Command-Line Interfaces (CLIs) are is tied directly to the software architecture. It becomes awkward to refactor the software components because changes will also alter the visible CLI.

The coordination of common features among many applications can become awkward. As a concrete example, imagine defining the various, one-letter abbreviated options for command-line arguments. We might want all of our applications to use -v for verbose output: this is an example of an option that would require careful coordination. Ensuring that there are no conflicts might require keeping some kind of master list of options, outside all of the individual applications.

This kind of common configuration should be kept in only one place in the code somewhere. Ideally, it would be in a common module, used throughout a family of applications.

Additionally, we often want to divorce the modules that perform useful work from the CLI. This lets us refactor the design without changing the user's understanding of how to use the application.

In this recipe, we'll look at ways to ensure that a suite of applications can be refactored without creating unexpected changes to the CLI. This means that complex additional design notes and instructions to users aren't required.

Getting ready

Many complex suites of applications follow a design pattern similar to the one used by Git. There's a base command, git, with a number of subcommands. For example, git pull, git commit, and git push. The core of the CLI can be centralized by the git command. The subcommands can then be organized and reorganized as needed with fewer changes to the visible CLI.

We'll imagine an application built from three commands. This is based on the applications shown in the Designing scripts for composition and Using logging for control and audit output recipes in Chapter 13, Application Integration: Configuration, and the Combining two applications into one recipe earlier in this chapter. We'll have three applications with three commands: craps simulate, craps summarize, and the combined application, craps simsum.

We'll rely on the subcommand design from the Combining many applications using the Command Design Pattern recipe earlier in this chapter. This will provide a handy hierarchy of Command subclasses:

  • The Command class is an abstract superclass.
  • The Simulate subclass performs the simulation functions from the Designing scripts for composition recipe.
  • The Summarize subclass performs summarization functions from the Using logging for control and audit output recipe.
  • A SimSum subclass can perform combined simulation and summarization, following the ideas of the Combining two applications into one recipe.

In order to create a simple command-line application, we'll need appropriate argument parsing.

This argument parsing will rely on the subcommand parsing capability of the argparse module. We can create a common set of command options that apply to all subcommands. We can also create unique options for each subcommand.

How to do it...

This recipe will start with a consideration of what the CLI commands need to look like. This first step might involve some prototypes or examples to be sure that the commands are truly useful to the user. After that, we'll implement the argument definitions in each of the Command subclasses.

  1. Define the CLI. This is an exercise in User Experience (UX) design. While most UX is focused on web and mobile device applications, the core principles are appropriate for CLI applications and servers, as well. Earlier, we noted that the root application will be Craps. It will have the following three subcommands:
    craps simulate -o game_file -g games 
    craps summarize -o summary_file game_file ... 
    craps simsum -g games 
    
  2. Define the root Python application. Consistent with other files in this book, we'll call it ch14_r02.py. At the OS level, we can provide an alias or a link to make the visible interface match the user expectation of a command like craps.
  3. We'll import the class definitions from the Combining many applications using the Command Design Pattern recipe. This will include the Command superclass and the Simulate, Summarize, and SimSum subclasses. We'll extend the Command class with an additional method, arguments(), to set the unique options in the argument parser for this command. This is a class method and is called on the class as a whole, not an instance of the class:
    class Command:
        @classmethod
        def arguments(
                cls,
                sub_parser: argparse.ArgumentParser
        ) -> None:
            Pass
        def __init__(self) -> None:
            pass
        def execute(self, options: argparse.Namespace) -> None:
            pass
    
  4. Here are the unique options for the Simulate command. We won't repeat the entire class definition, only the new arguments() method. This creates the arguments unique to the craps simulate command:
    class Simulate(Command):
        @classmethod
        def arguments(
                cls, 
                simulate_parser: argparse.ArgumentParser         ) -> None:
            simulate_parser.add_argument(
                "-g", "--games", type=int, default=100000)
            simulate_parser.add_argument(
                "-o", "--output", dest="game_file")
            simulate_parser.add_argument(
                "--seed",
                default=os.environ.get("CH14_R03_SEED", None)
            )
            simulate_parser.set_defaults(command=cls)
    
  5. Here is the new arguments() method of the Summarize command. This method creates arguments unique to the craps summarize command:
    class Summarize(Command):
        @classmethod
        def arguments(
                cls, 
                summarize_parser: argparse.ArgumentParser
            ) -> None:
            summarize_parser.add_argument(
                "-o", "--output", dest="summary_file")
            summarize_parser.add_argument(
                "game_files", nargs="*", type=Path)
            summarize_parser.set_defaults(command=cls)
    
  6. Here is the new arguments() method for the composite command, SimSum. This method creates arguments appropriate for the combined command:
    class SimSum(Command):
        @classmethod
        def arguments(
                cls, 
                simsum_parser: argparse.ArgumentParser
            ) -> None:
            simsum_parser.add_argument(
                "-g", "--games", type=int, default=100000)
            simsum_parser.add_argument(
                "-o", "--output", dest="summary_file")
            simsum_parser.add_argument(
                "--seed",
                default=os.environ.get("CH14_R03_SEED", None)
            )
            simsum_parser.set_defaults(command=cls)
    
  7. Create the overall argument parser. Use this to create a subparser builder. For each command, create a parser and add arguments that are unique to that command. The subparsers object will be used to create each subcommand's argument definition:
    import argparse
    def get_options(
            argv: List[str] = sys.argv[1:]
        ) -> argparse.Namespace:
        parser = argparse.ArgumentParser(prog="craps")
        subparsers = parser.add_subparsers()
        simulate_parser = subparsers.add_parser("simulate")
        Simulate.arguments(simulate_parser)
        
        summarize_parser = subparsers.add_parser("summarize")
        Summarize.arguments(summarize_parser)
        simsum_parser = subparsers.add_parser("simsum")
        SimSum.arguments(simsum_parser)
    
  8. Parse the command-line values. In this case, the overall argument to the get_options() function is expected to be the value of sys.argv[1:], which includes the arguments to the Python command. We can override the argument value for testing purposes:
    options = parser.parse_args(argv)
    if "command" not in options:
        parser.error("No command selected")
    return options
    

The overall parser includes three subcommand parsers. One will handle the craps simulate command, another handles craps summarize, and the third handles craps simsum. Each subcommand has slightly different combinations of options.

The command option is set via the set_defaults() method. This includes useful additional information about the command to be executed. In this case, we've provided the class that must be instantiated. The class will be a subclass of Command, with a known interface.

The overall application is defined by the following main() function:

def main() -> None:
    options = get_options(sys.argv[1:])
    command = cast(Type[Command], options.command)()
    command.execute(options)

The options will be parsed. Each distinct subcommand sets a unique class value for the options.command argument. This class is used to build an instance of a Command subclass. This object will have an execute() method that does the real work of this command.

Implement the OS wrapper for the root command. For Linux or macOS, we might have a file named craps. The file would have rx permissions so that it was readable by other users. The content of the file could be this line:

python Chapter_14/ch14_r03.py $* 

This small shell script provides a handy way to enter a command of craps and have it properly execute a Python script with a somewhat more complex name.

When the PYTHONPATH environment variable includes the applications we're building, we can also use this command to run them:

python -m Chapter_14.ch14_r03

This uses Python's sys.path to look for the package named Chapter_14 and the ch14_r03 module within that package.

How it works...

There are two parts to this recipe:

  • Using the Command Design Pattern to define a related set of classes that are polymorphic. For more information on this, see the Combining many applications using the Command Design Pattern recipe. In this case, we pushed parameter definition, initialization, and execution down to each subcommand as methods of their respective subclasses.
  • Using features of the argparse module to handle subcommands.

The argparse module feature that's important here is the add_subparsers() method of a parser. This method returns an object that is used to build each distinct subcommand parser. We assigned this object to the subparsers variable.

We also used the set_defaults() method of a parser to add a command argument to each of the subparsers. This argument will be populated by the defaults defined for one of the subparsers. The value assigned by the set_defaults() method actually used will show which of the subcommands was invoked.

Each sub parser is built using the add_parser() method of the subparsers object. The parser object that is returned can then have arguments and defaults defined.

When the overall parser is executed, it will parse any arguments defined outside the subcommands. If there's a subcommand, this is used to determine how to parse the remaining arguments.

Look at the following command:

craps simulate -g 100 -o x.yaml 

This command will be parsed to create a Namespace object that looks like this:

Namespace(command=<class '__main__.Simulate'>, game_file='x.yaml', games=100) 

The command attribute in the Namespace object is the default value provided as part of the subcommand definition. The values for game_file and games come from the -o and -g options.

There's more...

The get_optionsO function has an explicit list of classes that it is incorporating into the overall command. As shown, a number of lines of code are repeated, and this could be optimized. We can provide a data structure that replaces a number of lines of code:

def get_options_2(argv: List[str] = sys.argv[1:]) -> argparse.Namespace:
    parser = argparse.ArgumentParser(prog="craps")
    subparsers = parser.add_subparsers()
    sub_commands = [
        ("simulate", Simulate),
        ("summarize", Summarize),
        ("simsum", SimSum),
    ]
    for name, subc in sub_commands:
        cmd_parser = subparsers.add_parser(name)
        subc.arguments(cmd_parser)
    options = parser.parse_args(argv)
    if "command" not in options:
        parser.error("No command selected")
    return options

This variation on the get_options() function uses a sequence of two-tuples to provide the command name and the relevant class to implement the command. Iterating through this list assures that all of the various subclasses of Command are processed in a perfectly uniform manner.

We have one more optimization, but this relies on an internal feature of Python class definitions. Each class has references to subclasses built from the class, available via the __subclasses__() method. We can leverage this to create options that do not have an explicit list of classes. This doesn't always work out well, because any abstract subclasses aren't excluded from the list. For a very complex hierarchy, additional processing is required to confirm the classes are concrete:

def get_options_3(argv: List[str] = sys.argv[1:]) -> argparse.Namespace:
    parser = argparse.ArgumentParser(prog="craps")
    subparsers = parser.add_subparsers()
    for subc in Command.__subclasses__():
        cmd_parser = subparsers.add_parser(subc.__name__.lower())
        subc.arguments(cmd_parser)
    options = parser.parse_args(argv)
    if "command" not in options:
        parser.error("No command selected")
    return options

In this example, all the subclasses of Command are concrete classes. Using the Command.__subclasses__() list does not present any unusual or confusing options. It has the advantage of letting us create new subclasses and have them appear on the command line without any other code changes to expose them to users.

See also

  • See the Designing scripts for composition and Using logging for control and audit output recipes in Chapter 13, Application Integration: Configuration, for the basics of building applications focused on being composable.
  • See the Combining two applications into one recipe from earlier in this chapter for the background on the components used in this recipe.
  • See the Using argparse to get command-line input recipe in Chapter 6, User Inputs and Outputs, for more on the background of argument parsing.

Wrapping and combining CLI applications

One common kind of automation involves running several programs, none of which are Python applications. Since the programs aren't written in Python, it's impossible to refactor each program to create a composite Python application. When using a non-Python application, we can't follow the Combining two applications into one recipe shown earlier in this chapter.

Instead of aggregating the Python components, an alternative is to wrap the other programs in Python, creating a composite application. The use case is very similar to the use case for writing a shell script. The difference is that Python is used instead of a shell language. Using Python has some advantages:

  • Python has a rich collection of data structures. Most shell languages are limited to strings and arrays of strings.
  • Python has several outstanding unit test frameworks. Rigorous unit testing gives us confidence that the combined application will work as expected.

In this recipe, we'll look at how we can run other applications from within Python.

Getting ready

In the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration, we identified an application that did some processing that led to the creation of a rather complex result. For the purposes of this recipe, we'll assume that the application is not written in Python.

We'd like to run this program several hundred times, but we don't want to copy and paste the necessary commands into a script. Also, because the shell is difficult to test and has so few data structures, we'd like to avoid using the shell.

For this recipe, we'll pretend that the ch13_r05 application is a native binary application. We'll act as if it was written in C++, Go, or Fortran. This means that we can't simply import a Python module that comprises the application. Instead, we'll have to process this application by running a separate OS process.

For the purposes of pretending this application is a binary executable, we can add a "shebang" line as the first line in the file. In many cases, the following can be used:

 #!python3

When this is the first line of the file, most OSes will execute Python with the script file as the command-line argument. For macOS and Linux, use the following to change the mode of the file to executable:

chmod +x Chapter_14/ch14_r05.py 

Marking a file as executable means using the Chapter_14/ch14_r05.py command directly at the command prompt will run our application. See https://docs.python.org/3/using/windows.html#shebang-lines and the Writing Python script and module files recipe in Chapter 2, Statements and Syntax.

We can use the subprocess module to run any application program at the OS level. There are two common use cases for running another program from within Python:

  • The other program doesn't produce any output, or we don't want to gather the output in our Python program. The first situation is typical of OS utilities that return a status code when they succeed or fail. The second situation is typical of programs that update files and produce logs.
  • The other program produces the output; the Python wrapper needs to capture and process it.

In this recipe, we'll look at the first case—the output isn't something we need to capture. In the Wrapping a program and checking the output recipe, we'll look at the second case, where the output is scrutinized by the Python wrapper program.

In many cases, one benefit of wrapping an existing application with Python is the ability to dramatically rethink the UX. This lets us redesign how the CLI works.

Let's look at wrapping a program that's normally started with the following command:

python Chapter_14/ch14_r05.py --samples 10 --output game_${n}.yaml 

The output filename needs to be flexible so that we can run the program hundreds of times. This means creating files with numbers injected into the filenames. We've shown the placeholder for this number with ${n} in the command-line example.

We'd want the CLI to have only two positional parameters, a directory and a number. The program startup would look like this, instead:

python -m Chapter_14.ch14_r04 $TMPDIR 100

This simpler command frees us from having to provide filenames. Instead, we provide the directory and the number of games to simulate, and our Python wrapper will execute the given app, Chapter_14/ch14_r05.py, appropriately.

How to do it...

In this recipe, we'll start by creating a call to subprocess.run() that starts the target application. This is a spike solution (https://wiki.c2.com/?SpikeSolution) that we will use to be sure that we understand how the other application works. Once we have the command, we can wrap this in a function call to make it easier to use.

  1. Import the argparse and subprocess modules and the Path class. We'll also need the sys module and a type hint:
    import argparse
    import subprocess
    from pathlib import Path
    import sys
    from typing import List, Optional
    
  2. Write the core processing, using subprocess to invoke the target application. This can be tested separately to ensure that this really is the shell command that's required. In this case, subprocess.run() will execute the given command, and the check=True option will raise an exception if the status is non-zero. Here's the spike solution that demonstrates the essential processing:
    directory, n = Path("/tmp"), 42
    filename = directory / f"game_{n}.yaml"
    command = [
        "python", 
        "Chapter_13/ch13_r05.py",
        "--samples",
        "10",
        "--output",
        str(filename),
    ]
    subprocess.run(command, check=True)
    
  3. Wrap the spike solution in a function that reflects the desired behavior. The new design requires only a directory name and a number of files. This requires a for statement to create the required collection of files. Each unique filename is created with an f-string that includes a number in a template name. The processing looks like this:
    def make_files(directory: Path, files: int = 100) -> None:
        """Create sample data files."""
        for n in range(files):
            filename = directory / f"game_{n}.yaml"
            command = [
                "python",
                "Chapter_13/ch13_r05.py",
                "--samples",
                "10",
                "--output",
                str(filename),
            ]
            subprocess.run(command, check=True)
    
  4. Write a function to parse the command-line options. In this case, there are two positional parameters: a directory and a number of games to simulate. The function looks like this:
    def get_options(
            argv: Optional[List[str]] = None
    ) -> argparse.Namespace:
        if argv is None:
            argv = sys.argv[1:]
        parser = argparse.ArgumentParser()
        parser.add_argument("directory", type=Path)
        parser.add_argument("games", type=int)
        options = parser.parse_args(argv)
        return options
    
  5. Combine the parsing and execution into a main function:
    def main():
        options = get_options()
        make_files(options.directory, options.games)
    

We now have a function that's testable using any of the Python unit testing frameworks. This can give us real confidence that we have a reliable application built around an existing non-Python application.

How it works...

The subprocess module is how Python programs run other programs available on a given computer. The run() function in this module does a number of things for us.

In a POSIX (such as Linux or macOS) context, the steps are similar to the following sequence:

  1. Prepare the stdin, stdout, and stderr file descriptors for the child process. In this case, we've accepted the defaults, which means that the child inherits the files being used by the parent. If the child process prints to stdout, it will appear on the console being used by the parent.
  2. Invoke a function like the os.execve() function to start the child process with the given stdin, stdout, and stderr files.
  3. While the child is running, the parent is waiting for the child process to finish and return the final status.
  4. Since we used the check=True option, a non-zero status is transformed into an exception by the run() function.

An OS shell, such as bash, conceals these details from application developers. The subprocess.run() function, similarly, hides the details of creating and waiting for a child process.

Python, via the subprocess module, offers many features similar to the shell. Most importantly, Python offers several additional sets of features:

  • A much richer collection of data structures than the shell.
  • Exceptions to identify problems that arise. This can be much simpler and more reliable than inserting if statements throughout a shell script to check status codes.
  • A way to unit test the script without using OS resources.

Using the subprocess module to run a separate executable allows Python to integrate a wide variety of software components into a unified whole. Using Python instead of the shell for application integration provides clear advantages over writing difficult-to-test shell scripts.

There's more...

We'll add a simple clean-up feature to this script. The idea is that all of the output files should be created as an atomic operation. We want all of the files or none of the files. We don't want an incomplete collection of data files.

This fits with the ACID properties:

  • Atomicity: The entire set of data is available or it is not available. The collection is a single, indivisible unit of work.
  • Consistency: The filesystem should move from one internally consistent state to another consistent state. Any summaries or indices will properly reflect the actual files.
  • Isolation: If we want to process data concurrently, then having multiple, parallel processes should work. Concurrent operations should not interfere with each other.
  • Durability: Once the files are written, they should remain on the filesystem. This property almost goes without saying for files. For more complex databases, it becomes necessary to consider transaction data that might be acknowledged by a database client but is not actually written yet to a server.

Isolation and durability are already part the OS filesystem's semantics. Generally, OS processes with separate working directories work out well. The atomicity and consistency properties, however, can lead to a need for a clean-up operation in the event of an application failure leaving corrupt files.

In order to clean up, we'll need to wrap the core processing in a try: block. We'll write a second function, make_files_clean(), that uses the original make_files() function to include a clean-up feature. A new overall function, make_files_clean(), would look like this:

def make_files_clean(directory: Path, files: int = 100) -> None:
    """Create sample data files, with cleanup after a failure."""
    try:
        make_files(directory, files)
    except subprocess.CalledProcessError as ex:
        # Remove any files.
        for partial in directory.glob("game_*.yaml"):
            partial.unlink()
        raise

The exception-handling block does two things. First, it removes any incomplete files from the current working directory. Second, it re-raises the original exception so that the failure will propagate to the client application.

Unit test

We have two scenarios to confirm. These scenarios can be described in Gherkin as follows:

Scenario: Everything Worked
Given An external application, Chapter_14/ch14_r05.py, that works correctly
When The application is invoked 3 times
Then The subprocess.run() function will be called 3 times
And The output file pattern has 3 matches.
Scenario: Something failed
Given An external application, Chapter_14/ch14_r05.py, that works once, then fails after the first run
When The application is invoked 3 times
Then The subprocess.run() function will be called 2 times
And The output file pattern has 0 matches.

The Given steps suggest we'll need to isolate the external application. We'll need two different mock objects to replace the run() function in the subprocess module. We can use mocks because we don't want to actually run the other process; we want to be sure that the run() function is called appropriately by the make_files() function.

One of the mocked run() functions needs to act as if the subprocess finished normally. The other mock needs to behave as if the called process fails.

Testing with mock objects means we never run the risk of accidentally overwriting or deleting useful files when testing. This is a significant benefit of using Python for this kind of automation.

Here are two fixtures to create a Mock object to succeed as well as a Mock object to fail:

from pathlib import Path
from subprocess import CalledProcessError
from unittest.mock import Mock, patch, call
from pytest import *  # type: ignore
import Chapter_14.ch14_r04
@fixture  # type: ignore
def mock_subprocess_run_good():
    def make_file(command, check):
        Path(command[5]).write_text("mock output")
    run_function = Mock(
        side_effect=make_file
    )
    return run_function
@fixture  # type: ignore
def mock_subprocess_run_fail():
    def make_file(command, check):
        Path(command[5]).write_text("mock output")
    run_function = Mock(
        side_effect=[
            make_file,
            CalledProcessError(13, ['mock', 'command'])
        ]
    )
    return run_function

The mock_subprocess_run_good fixture creates a Mock object. monkeypatch can use this in place of the standard library's subprocess.run() function. This will create some files that are stand-ins for the real work of the underlying Chapter_14/ch14_r05.py application that's being wrapped.

mock_subprocess_run_fail creates a Mock object that will work once, but on the second invocation, it will raise a CalledProcessError exception and fail. This fixture will also create a mock of the output file. In this case, because of the exception, we'd like the file to be cleaned up.

We also need to use the pytest.tmpdir fixture. This provides a temporary directory in which we can create and destroy files safely.

Here's a test case for the "everything worked" scenario:

def test_make_files_clean_good(
        mock_subprocess_run_good,
        monkeypatch,
        tmpdir):
    monkeypatch.setattr(
        Chapter_14.ch14_r04.subprocess,
        'run',
        mock_subprocess_run_good)
    directory = Path(tmpdir)
    Chapter_14.ch14_r04.make_files_clean(directory, files=3)
    expected = [
        call(
            [
                "python",
                "Chapter_13/ch13_r05.py",
                "--samples",
                "10",
                "--output",
                str(tmpdir/"game_0.yaml"),
            ],
            check=True,
        ),
        call(
            [
                "python",
                "Chapter_13/ch13_r05.py",
                "--samples",
                "10",
                "--output",
                str(tmpdir/"game_1.yaml"),
            ],
            check=True,
        ),
        call(
            [
                "python",
                "Chapter_13/ch13_r05.py",
                "--samples",
                "10",
                "--output",
                str(tmpdir/"game_2.yaml"),
            ],
            check=True,
        ),
    ]
    assert expected == mock_subprocess_run_good.mock_calls
    assert len(list(directory.glob("game_*.yaml"))) == 3

The monkeypatch fixture is used to replace the subprocess.run() function with the Mock object created by our mock_subprocess_run_good fixture. This will write mock results into the given files. This implements the scenario's Given step.

The When step is implemented by invoking the make_files_clean() function. The Then step needs to confirm a couple of things:

  • That each of the calls to subprocess.run() has the expected parameters.
  • That the expected number of output files has been created.

A second test case is required for the second scenario. This will use the mock_subprocess_run_fail fixture. The Then step will confirm that there were two expected calls. The most important part of this second scenario is confirming that zero files were left behind after the clean-up operation.

These unit tests provide confidence that the processing will work as advertised. The testing is done without accidentally deleting the wrong files.

See also

  • This kind of automation is often combined with other Python processing. See the Designing scripts for composition recipe in Chapter 12, Web Services.
  • The goal is often to create a composite application; see the Managing arguments and configuration in composite applications recipe earlier in this chapter.
  • For a variation on this recipe, see the Wrapping a program and checking the output recipe, which is next in this chapter.

Wrapping a program and checking the output

One common kind of automation involves running several programs, none of which are actually Python applications. In this case, it's impossible to refactor the programs to create a composite Python application. In order to properly aggregate the functionality, the other programs must be wrapped as a Python class or module to provide a higher-level construct.

The use case for this is very similar to the use case for writing a shell script. The difference is that Python can be a better programming language than the OS's built-in shell languages.

In some cases, the advantage Python offers is the ability to perform detailed aggregation and analysis of the output files. A Python program might transform, filter, or summarize the output from a subprocess.

In this recipe, we'll see how to run other applications from within Python, collecting and processing the other applications' output.

Getting ready

In the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration, we identified an application that did some processing, leading to the creation of a rather complex result. We'd like to run this program several hundred times, but we don't want to copy and paste the necessary commands into a script. Also, because the shell is difficult to test and has so few data structures, we'd like to avoid using the shell.

For this recipe, we'll assume that the ch13_r05 application is a native binary application written in Go, Fortran, or C++. This means that we can't simply import the Python module that comprises the application. Instead, we'll have to process this application by running a separate OS process.

We will use the subprocess module to run an application program at the OS level. There are two common use cases for running another binary program from within Python:

  • Either there isn't any output, or we don't want to process the output file in our Python program.
  • We need to capture and possibly analyze the output to retrieve information or ascertain the level of success. We might need to transform, filter, or summarize the log output.

In this recipe, we'll look at the second case—the output must be captured and summarized. In the Wrapping and combining CLI applications recipe earlier in this chapter, we looked at the first case, where the output is simply ignored.

Here's an example of running the ch13_r05 application:

(cookbook) % python Chapter_14/ch14_r05.py --samples 10 --output=data/x.yaml
Namespace(output='data/x.yaml', output_path=PosixPath('data/x.yaml'), samples=10, seed=None)
Counter({7: 8, 9: 6, 5: 6, 6: 6, 8: 4, 3: 3, 10: 3, 4: 3, 11: 2, 2: 1, 12: 1})

There are two lines of output that are written to the OS standard output file. The first has a summary of the options, starting with the string Namespace. The second line of output is a summary of the file's data, starting with the string Counter. We want to capture the details of these Counter lines from this application and summarize them.

How to do it...

We'll start by creating a spike solution (https://wiki.c2.com/?SpikeSolution) that shows the command and arguments required to run another application from inside a Python application. We'll transform the spike solution into a function that captures output for further analysis.

  1. Import the argparse and subprocess modules and the Path class. We'll also need the sys module and several type hints:
    import argparse
    from pathlib import Path
    import subprocess
    import sys
    from typing import Counter, List, Any, Iterable, Iterator
    
  2. Write the core processing, using subprocess to invoke the target application. This can be tested separately to be sure that this really is the shell command that's required. In this case, subprocess.run() will execute the given command, and the check=True option will raise an exception if the status is non-zero. In order to collect the output, we've provided an open file in the stdout parameter to subprocess.run(). In order to be sure that the file is properly closed, we've used that file as the context manager for a with statement. Here's a spike solution that demonstrates the essential processing:
    directory, n = Path("/tmp"), 42
    filename = directory / f"game_{n}.yaml"
    temp_path = directory / "stdout.txt"
    command = [
        "python", 
        "Chapter_13/ch13_r05.py",
        "--samples",
        "10",
        "--output",
        str(filename),
    ]
    with temp_path.open('w') as temp_file:
        process = subprocess.run(
            command,
            stdout=temp_file,
            check=True,
           text=True)
    output_text = temp_path.read_text()
    
  3. Wrap the initial spike solution in a function that reflects the desired behavior. We'll decompose this into two parts, delegating command creation to a separate function. This function will consume the commands to execute from an iterable source of commands. This generator function will yield the lines of output gathered as each command is executed. Separating command building from command execution is often helpful when the commands can change. The core use of subprocess.run() is less likely to change. The processing looks like this:
    def command_output_iter(
            temporary: Path, 
            commands: Iterable[List[str]]
        ) -> Iterator[str]:
        for command in commands:
            temp_path = temporary/"stdout"
            with temp_path.open('w') as temp_file:
                process = subprocess.run(
                    command,
                    stdout=temp_file,
                    check=True,
                    text=True)
            output_text = temp_path.read_text()
            output_lines = (
                l.strip() for l in output_text.splitlines())
            yield from output_lines
    
  4. Here's the generator function to create the commands for the command_output_iter() generator function. Because this has been separated, it's slightly easier to respond to design changes in the underlying Chapter_13/ch13_r05.py application. Here's a generator that produces the commands:
    def command_iter(
            directory: Path, 
            files: int
        ) -> Iterable[List[str]]:
        for n in range(files):
            filename = directory / f"game_{n}.yaml"
            command = [
                "python",
                "Chapter_13/ch13_r05.py",
                "--samples",
                "10",
                "--output",
                str(filename),
            ]
            yield command
    
  5. The overall purpose behind this application is to collect and process the output from executing each command. Here's a function to extract the useful information from the command output. This function uses the built-in eval() to parse the output and reconstruct the original Counter object. In this case, the output happens to fit within the kind of things that eval() can parse. The generator function looks like this:
    import collections
    def collect_batches(output_lines_iter: Iterable[str]) -> Iterable[Counter[Any]]:
        for line in output_lines_iter:
            if line.startswith("Counter"):
                batch_counter = eval(line)
                yield batch_counter
    
  6. Write the function to summarize the output collected by the collect_batches() function. We now have a stack of generator functions. The command_sequence object yields commands. The output_lines_iter object yields the lines of output from each command. The batch_summaries object yields the individual Counter objects:
    def summarize(
            directory: Path,
            games: int,
            temporary: Path
        ) -> None:
        total_counter: Counter[Any] = collections.Counter()
        command_sequence = command_iter(directory, games)
        output_lines_iter = command_output_iter(
            temporary, command_sequence)
        batch_summaries = collect_batches(output_lines_iter)
        for batch_counter in batch_summaries:
            print(batch_counter)
            total_counter.update(batch_counter)
        print("Total")
        print(total_counter)
    
  7. Write a function to parse the command-line options. In this case, there are two positional parameters: a directory and a number of games to simulate. The function looks like this:
    def get_options(
            argv: List[str] = sys.argv[1:]
        ) -> argparse.Namespace:
        parser = argparse.ArgumentParser()
        parser.add_argument("directory", type=Path)
        parser.add_argument("games", type=int)
        options = parser.parse_args(argv)
        return options
    
  8. Combine the parsing and execution into a main function:
    def main() -> None:
        options = get_options()
        summarize(
            directory=options.directory,
            games=options.games,
            temporary=Path("/tmp")
        )
    

Now we can run this new application and have it execute the underlying application and gather the output, producing a helpful summary. We've built this using Python instead of a bash (or other shell) script. Python offers more useful data structures and unit testing.

How it works...

The subprocess module is how Python programs run other programs available on a given computer. The run() function does a number of things for us.

In a POSIX (such as Linux or macOS) context, the steps are similar to the following:

  • Prepare the stdin, stdout, and stderr file descriptors for the child process. In this case, we've arranged for the parent to collect output from the child. The child will produce an stdout file to a shared buffer (a pipe in Unix parlance) that is consumed by the parent. The stderr output, on the other hand, is left alone—the child inherits the same connection the parent has, and error messages will be displayed on the same console being used by the parent.
  • Use a function like os.execve() to split the current process into parent and child, and then start the child process.
  • The child process then starts, using the given stdin, stdout, and stderr.
  • While the child is running, the parent is waiting for the child process to finish.
  • Since we used the check=True option, a non-zero status is transformed into an exception.
  • Once the child completes processing, the file opened by the parent for collecting standard output can be read by the parent.

The subprocess module gives us access to one of the most important parts of the operating system: launching a subprocess. Because we can tailor the environment, we have tremendous control over the application that is started by our Python application.

There's more...

Once we've wrapped Chapter_13/ch13_r05.py (assumed for the sake of exposition to be an executable file originally coded in whatever language its authors preferred!) within a Python application, we have a number of alternatives available to us for improving the output. In particular, the summarize() function in our wrapper application suffers from the same design problem the underlying Chapter_13/ch13_r05.py application suffers from; that is, the output is in Python's repr() format.

When we run it, we see the following:

(cookbook) % python Chapter_14/ch14_r05.py data 5  
Counter({7: 6, 8: 5, 10: 5, 3: 4, 5: 3, 4: 3, 9: 2, 6: 2, 2: 1, 11: 1})
Counter({6: 7, 5: 5, 7: 5, 8: 3, 12: 3, 4: 2, 3: 2, 10: 1, 9: 1})
Counter({8: 7, 6: 5, 7: 4, 5: 4, 11: 4, 4: 4, 10: 3, 12: 2, 9: 2, 3: 1})
Counter({5: 7, 9: 6, 3: 5, 11: 4, 8: 4, 10: 3, 4: 3, 6: 3, 12: 2, 7: 2})
Counter({6: 6, 5: 6, 3: 5, 8: 5, 7: 4, 9: 3, 4: 3, 10: 2, 11: 2, 12: 1})
Total
Counter({5: 25, 8: 24, 6: 23, 7: 21, 3: 17, 4: 15, 10: 14, 9: 14, 11: 11, 12: 8, 2: 1}) 

An output file in standard JSON or CSV would be more useful.

Because we've wrapped the underlying application, we don't need to change the underlying ch13_r05 application to change the results it produces. We can modify our wrapper program, leaving the original data generator intact.

We need to refactor the summarize() function to replace the print() function calls with a function that has a more useful format. A possible rewrite would change this function into two parts: one part would create the Counter objects, the other part would consume those Counter objects and write them to a file:

def summarize_2(
        directory: Path,
        games: int,
        temporary: Path
) -> None:
    
    def counter_iter(
            directory: Path,
            games: int,
            temporary: Path
    ) -> Iterator[Counter]:
        total_counter: Counter[Any] = collections.Counter()
        command_sequence = command_iter(directory, games)
        output_lines_iter = command_output_iter(
            temporary, command_sequence)
        batch_summaries = collect_batches(output_lines_iter)
        for batch_counter in batch_summaries:
            yield batch_counter
            total_counter.update(batch_counter)
        yield total_counter
    wtr = csv.writer(sys.stdout)
    for counter in counter_iter(directory, games, temporary):
        array = [counter[i] for i in range(20)]
        wtr.writerow(array)

This variation on the summarize() function emits output in CSV format. The internal counter_iter() generator does the essential processing to create a number of Counter summaries from each run of the simulation. This output is consumed by a for counter in the statement calling counter_iter(). Each Counter object is expanded into an array of values with game lengths from zero to twenty, and the frequency of games of the given length.

This rewrite didn't involve changing the underlying application. We were able to build useful features separately by creating layers of features. Leaving the underlying application untouched can help perform regression tests to be sure the core statistical validity has not been harmed by adding new features.

See also

  • See the Wrapping and combining CLI applications recipe from earlier in this chapter for another approach to this recipe.
  • This kind of automation is often combined with other Python processing. See the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration.
  • The goal is often to create a composite application; see the Managing arguments and configuration in composite applications recipe from earlier in this chapter.
  • Many practical applications will work with more complex output formats. For information on processing complex line formats, see the String parsing with regular expressions recipe in Chapter 1, Numbers, Strings, and Tuples, and the Reading complex formats using regular expressions recipe in Chapter 8, More Advanced Class Design. Much of Chapter 10, Input/Output, Physical Format, and Logical Layout, relates to the details of parsing input files.

Controlling complex sequences of steps

In the Combining two applications into one recipe earlier in this chapter, we looked at ways to combine multiple Python scripts into a single, longer, more complex operation. In the Wrapping and combining CLI applications and Wrapping a program and checking the output recipes earlier in this chapter, we looked at ways to use Python to wrap not-necessarily-Python executable programs.

We can combine these techniques to create even more flexible processing. We can create longer, more complex sequences of operations.

Getting ready

In the Designing scripts for composition recipe in Chapter 13, Application Integration: Configuration, we created an application that did some processing that led to the creation of a rather complex result. In the Using logging for control and audit output recipe in Chapter 13, Application Integration: Configuration, we looked at a second application that built on those results to create a sophisticated statistical summary.

The overall process looks like this:

  1. Run the ch13_r05 program 100 times to create 100 intermediate files.
  2. Run the ch13_r06 program to summarize those intermediate files.

For the purposes of this recipe, we'll assume that neither of these applications is written in Python. We'll pretend that they're written in Fortran or Ada or some other language that's not directly compatible with Python.

In the Combining two applications into one recipe, we looked at how we can combine Python applications. When applications are not written in Python, a little additional work is required.

This recipe uses the Command Design Pattern. This supports the expansion and modification of the sequences of commands by creating new subclasses of an abstract base class.

How to do it...

We'll use the Command Design Pattern to define classes to wrap the external commands. Using classes will let us assemble more complex sequences and alternative processing scenarios where Python objects act as proxies for external applications run as subprocesses.

  1. We'll define an abstract Command class. The other commands will be defined as subclasses. The execute() method works by first creating the OS-level command to execute. Each subclass will provide distinct rules for the commands that are wrapped. Once the OS-level command has been built, the run() function of the subprocess module will process the OS command. In this recipe, we're using subprocess.PIPE to collect the output. This is only suitable for relatively small output files that won't overflow internal OS buffers.

    The os_command() method builds the sequence of text strings comprising a command to be executed by the OS. The value of the options parameter will be used to customize the argument values used to assemble the command. This superclass implementation provides some debugging information. Each subclass must override this method to create the unique OS command required to perform some useful work:

    class Command:
        def execute(
                self,
                options: argparse.Namespace
        ) -> str:
            self.command = self.create_command(options)
            results = subprocess.run(
                self.command,
                check=True,
                stdout=subprocess.PIPE,
                text=True
            )
            self.output = results.stdout
            return self.output
        def os_command(
                self,
                options: argparse.Namespace
        ) -> List[str]:
            return [
                "echo", self.__class__.__name__, repr(options)]
    
  2. We can create a Command subclass to define a command to simulate the game and create samples. In this case, we provided an override for the execute() method so this class can change the OS environment variables before executing the underlying OS command. This allows an integration test to set a specific random seed and confirm that the results match a fixed set of expected values:
    import Chapter_13.ch13_r05 as ch13_r05
    class Simulate(Command):
        def execute(
                self,
                options: argparse.Namespace
        ) -> str:
            if 'seed' in options:
                os.environ["RANDOMSEED"] = str(options.seed)
            return super().execute(options)
    
  3. The os_command() method of the Simulate class emits the sequence of words for a command to execute the ch13_r05 application. This converts the numeric value of options.samples to a string as required by the interface to the OS:
        def os_command(
                self,
                options: argparse.Namespace
        ) -> List[str]:
            return [
                "python",
                "Chapter_13/ch13_r05.py",
                "--samples",
                str(options.samples),
                "-o",
                options.game_file,
            ]
    
  4. We can also extend the Command superclass to define a subclass, Summarize, to summarize the various simulation processes. In this case, we only implemented os_command(). This implementation provides the arguments for the ch13_r06 command:
    import Chapter_13.ch13_r06 as ch13_r06
    class Summarize(Command):
        def os_command(
                self, 
                options: argparse.Namespace
        ) -> List[str]:
            return [
                "python",
                "Chapter_13/ch13_r06.py",
                "-o",
                options.summary_file,
            ] + options.game_files
    
  5. Given these two commands, the overall main program can follow the design pattern from the Designing scripts for composition recipe. We need to gather the options, and then use these options to execute the two commands:
    def demo():
        options = Namespace(
            samples=100,
            game_file="data/x12.yaml",
            game_files=["data/x12.yaml"],
            summary_file="data/y12.yaml",
            seed=42
        )
        step1 = Simulate()
        step2 = Summarize()
        output1 = step1.execute(options)
        print(step1.os_cmd, output1)
        output2 = step2.execute(options)
        print(step2.os_cmd, output2)
    

This demonstration function, demo(), creates a Namespace instance with the parameters that could have come from the command line. It builds the two processing steps. Finally, it executes each step, displaying the collected output.

This kind of function provides a high-level script for executing a sequence of applications. It's considerably more flexible than the shell, because we can make use of Python's rich collection of data structures. Because we're using Python, we can more easily include unit tests as well.

How it works...

There are two interlocking design patterns in this recipe:

  • The Command class hierarchy
  • Wrapping external commands by using the subprocess.run() function

The idea behind a Command class hierarchy is to make each separate step or operation into a subclass of a common superclass. In this case, we've called that superclass Command. The two operations are subclasses of the Command class. This assures that we can provide common features to all of the classes.

Wrapping external commands has several considerations. One primary question is how to build the command-line options that are required. In this case, the run() function will use a list of individual words, making it very easy to combine literal strings, filenames, and numeric values into a valid set of options for a program.

The other primary question is how to handle the OS-defined standard input, standard output, and standard error files. In some cases, these files can be displayed on the console. In other cases, the application might capture those files for further analysis and processing.

The essential idea here is to separate two considerations:

  1. The overview of the commands to be executed. This includes questions about sequences, iteration, conditional processing, and potential changes to the sequence. These are higher-level considerations related to the user stories.
  2. The details of how to execute each command. This includes command-line options, output files used, and other OS-level considerations. These are more technical considerations of the implementation details.

Separating the two makes it easier to implement or modify the user stories. Changes to the OS-level considerations should not alter the user stories; the process might be faster or use less memory, but is otherwise identical. Similarly, changes to the user stories do not need to break the OS-level considerations. The user's interaction with the low-level applications is mediated by a flexible layer of Python.

What's central here is that a user story captures what the user needs to do. We often write them as "As a [persona…], I [want to…], [so that…]." This captures the user's goal in a form that we can use to build application software.

There's more...

A complex sequence of steps can involve iteration of one or more steps. Since the high-level script is written in Python, adding iteration is done with the for statement:

class IterativeSimulate(Command):
    """Iterative Simulation"""
    def execute(
            self,
            options: argparse.Namespace
    ) -> None:
        step1 = Simulate()
        options.game_files = []
        for i in range(options.simulations):
            options.game_file = f"data/game_{i}.yaml"
            options.game_files.append(options.game_file)
            step1.execute(options)
        step2 = Summarize()
        step2.execute(options)

This IterativeSimulate subclass of Command will process the Simulate step many times. It uses the simulations option to specify how many simulations to run. Each simulation will produce the expected number of samples.

This function will set a distinct value for the game_file option for each iteration of the processing. Each of the resulting filenames will be unique, leading to a number of sample files. The list of files is also collected into the game_files option.

When the next step, the Summarize class, is executed, it will have the proper list of files to process. The Namespace object, assigned to the options variable, can be used to track global state changes and provide this information to subsequent processing steps.

Because this is a subclass of Command, we can be sure that it is interchangeable with other commands. We may to also use the Sequence subclass of Command from the Combining many applications using the Command Design Pattern recipe to create more complex sequences of commands.

Building conditional processing

Since the high-level programming is written in Python, we can add additional processing that isn't based on the two applications that are wrapped. One feature might be an optional summarization step.

For example, if the options do not have a summary_file option, then any summarization processing can be skipped. This might lead to a subclass of the Command class that looks like this:

class ConditionalSummarize(Command):
    """Conditional Summarization"""
    def execute(
            self,
            options: argparse.Namespace
    ) -> str:
        step1 = Simulate()
        output = step1.execute(options)
        if "summary_file" in options:
            step2 = Summarize()
            output += step2.execute(options)
        return output

This ConditionalSummarize class will process the Summarize step conditionally. It will only create an instance of the Summarize class if there is a summary_file option.

We've used a subclass of Command to promote the idea of composability. We should be able to compose a more complex solution from individual components. In this case, the Python components are classes that wrap external commands and applications.

In this case, and the previous example, we've used Python programming to augment the two application programs with iteration and conditional features. This concept extends to error handling and recovery, also. We could use Python to clean up incomplete files. We can also use Python to handle a file rename to make sure that the latest and greatest results are always available after processing.

See also

  • Generally, these kinds of processing steps are done for larger or more complex applications. See the Combining two applications into one and Managing arguments and configuration in composite applications recipes from earlier in this chapter for ways to work with larger and more complex composite applications.
  • See Replacing a file while preserving the previous version in Chapter 9, Functional Programming Features, for a way to create output files so that a useful version is always available in spite of problems with unreliable networks or binary applications.
  • See Separating concerns via multiple inheritance in Chapter 7, Basics of Classes and Objects, for some additional ideas for designing a Command class hierarchy to handle complex relationships among applications.
  • When building more complex applications, consider the Using logging for control and audit output recipe in Chapter 13, Application Integration: Configuration, for ways to integrate logging as a consistent aspect of the various applications.
  • For more information on user stories, see https://www.atlassian.com/agile/project-management/user-stories
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.133.159.224